-
Notifications
You must be signed in to change notification settings - Fork 52
feat(gcp): add GPU guest accelerator support for GCP nodepools #1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add support for attaching NVIDIA GPUs to GCP compute instances via the guest_accelerator block. GCP requires explicit GPU type and count configuration unlike other providers where GPU enabled instance types automatically include GPUs.
WalkthroughThis PR enables GPU acceleration support for cloud providers where instance types don't automatically include GPUs. The changes introduce GPU count and type fields to the MachineSpec schema, add GCP-specific GPU configuration validation, update documentation, and refactor existing GPU field references throughout the codebase to support the new naming convention while maintaining backward compatibility. Changes
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/input-manifest/providers/gcp.md`:
- Around line 98-115: Fix the hyphenation in the `nvidia-tesla-v100` description
to "high-performance training" and update the machine-type guidance: remove any
mention of N2/N2D as GPU-capable, state that attached GPUs should use
`n1-standard-*` or `n1-highmem-*` machine types (i.e., N1 supports GPUs), and
add a clarified note that GPU-optimized families such as A2, G2, A3, A4, G4 come
with pre-attached GPU configurations rather than supporting arbitrary attached
GPUs; keep the `nvidia-tesla-*` GPU type list as-is but ensure the "GPU
Availability" and "GPU Instance Limitations" warnings reflect this distinction.
🧹 Nitpick comments (3)
Makefile (1)
95-102: Fail fast in kind-deploy and scope rollout to updated deployments.Line 97–102: as written, a failed
kubectl set imageinside the loop can be masked, and the finalkubectl rollout status deployment -n $(KIND_NAMESPACE)can block on unrelated deployments in the namespace. Consider failing fast and rolling out per deployment with a timeout.♻️ Suggested update
kind-deploy: kind-load-images `@echo` " --- updating deployments in $(KIND_NAMESPACE) namespace --- " - `@for` svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \ + `@set` -e; \ + for svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \ echo " --- updating $$svc deployment --- "; \ kubectl set image deployment/$$svc $$svc=ghcr.io/berops/claudie/$$svc:$(REV) -n $(KIND_NAMESPACE); \ + kubectl rollout status deployment/$$svc -n $(KIND_NAMESPACE) --timeout=5m; \ done - `@echo` " --- waiting for rollouts to complete --- " - `@kubectl` rollout status deployment -n $(KIND_NAMESPACE) + `@echo` " --- rollouts completed --- "internal/api/manifest/validate_node_pool.go (1)
145-170: Clarify error text for deprecated GPU count usage.If users still set
nvidiaGpu(deprecated), the current message may be confusing. Consider naming both fields.🛠️ Suggested tweak
- return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount > 0") + return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount (or deprecated nvidiaGpu) > 0")internal/api/manifest/validate_test.go (1)
332-440: Consider adding test case for GCP with deprecatedNvidiaGpufield.The test covers the new
NvidiaGpuCountfield well for GCP and validates backward compatibility for non-GCP providers using the deprecatedNvidiaGpufield. However, there's a gap: what happens when a GCP nodepool uses the deprecatedNvidiaGpufield without specifyingNvidiaGpuType?If the validation logic correctly considers both fields when determining GPU presence, this scenario should also fail for GCP. Adding this test case would ensure the deprecated field path is also validated consistently for GCP.
Suggested additional test case
r.NoError(hetznerNodepoolDeprecatedGpu.Validate(hetznerManifest), "Non-GCP nodepool with deprecated nvidiaGpu but no type should pass validation") + + // Test case 6: GCP nodepool with deprecated nvidiaGpu field but no type - should fail (GCP requires type regardless of which field is used) + gcpNodepoolDeprecatedGpuNoType := &DynamicNodePool{ + Name: "gpu-np-dep", + ServerType: "n1-standard-4", + Image: "ubuntu-2204", + Count: 1, + ProviderSpec: ProviderSpec{ + Name: "gcp-1", + Region: "us-central1", + Zone: "us-central1-a", + }, + MachineSpec: &MachineSpec{ + NvidiaGpu: 1, // Using deprecated field + }, + } + r.Error(gcpNodepoolDeprecatedGpuNoType.Validate(gcpManifest), "GCP nodepool with deprecated nvidiaGpu but no type should fail validation") }
Summary
Closes #1845
This PR adds GPU guest accelerator support for GCP nodepools, allowing users to attach NVIDIA GPUs to GCP compute instances via Claudie's InputManifest.
Changes
nvidiaGpuCount(renamed fromnvidiaGpu) andnvidiaGpuTypefields toMachineSpecMachineSpecstruct with new GPU fields and backward compatibility fornvidiaGpuvalidateGCPGpuConfig()to ensure GCP nodepools with GPUs have the required type specifiednvidiaGpuCountandnvidiaGpuTypefieldsNvidiaGpuCountfield namekind-deploytarget for easier local testingExample Usage
Backward Compatibility
nvidiaGpufield is preserved as a deprecated alias fornvidiaGpuCountTemplate Changes
GCP template changes implemented in berops/claudie-config#22
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.