Skip to content

Conversation

@samuelstolicny
Copy link
Contributor

@samuelstolicny samuelstolicny commented Jan 27, 2026

Summary

Closes #1845

This PR adds GPU guest accelerator support for GCP nodepools, allowing users to attach NVIDIA GPUs to GCP compute instances via Claudie's InputManifest.

Changes

  • Proto: Added nvidiaGpuCount (renamed from nvidiaGpu) and nvidiaGpuType fields to MachineSpec
  • Manifest: Updated MachineSpec struct with new GPU fields and backward compatibility for nvidiaGpu
  • Validation: Added validateGCPGpuConfig() to ensure GCP nodepools with GPUs have the required type specified
  • CRD: Regenerated with new nvidiaGpuCount and nvidiaGpuType fields
  • Autoscaler: Updated to use NvidiaGpuCount field name
  • Documentation: Added GPU configuration docs for GCP provider
  • Makefile: Added kind-deploy target for easier local testing

Example Usage

nodePools:
  dynamic:
    - name: gpu-workers
      providerSpec:
        name: gcp-provider
        region: europe-west1
        zone: europe-west1-b
      count: 1
      serverType: n1-standard-4
      image: ubuntu-2204-lts
      machineSpec:
        nvidiaGpuCount: 1
        nvidiaGpuType: nvidia-tesla-t4

Backward Compatibility

  • The old nvidiaGpu field is preserved as a deprecated alias for nvidiaGpuCount
  • Non-GCP providers can still use GPU count without specifying type
  • Proto field numbers maintained for wire compatibility

Template Changes

GCP template changes implemented in berops/claudie-config#22

Summary by CodeRabbit

Release Notes

  • New Features

    • GCP now supports NVIDIA GPUs with configurable GPU types
    • Added comprehensive GPU Operator deployment guide with step-by-step instructions
  • Documentation

    • Updated GPU configuration examples for AWS and GCP with detailed machineSpec specifications
    • Added GCP GPU support documentation including available GPU types and deployment considerations
    • Updated provider support matrix reflecting full GCP feature support

✏️ Tip: You can customize this high-level summary in your review settings.

Samuel STOLICNY (contractor) added 2 commits January 27, 2026 16:24
Add support for attaching NVIDIA GPUs to GCP compute instances via the guest_accelerator block. GCP requires explicit GPU type and count configuration unlike other providers where GPU enabled instance types automatically include GPUs.
@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Walkthrough

This PR enables GPU acceleration support for cloud providers where instance types don't automatically include GPUs. The changes introduce GPU count and type fields to the MachineSpec schema, add GCP-specific GPU configuration validation, update documentation, and refactor existing GPU field references throughout the codebase to support the new naming convention while maintaining backward compatibility.

Changes

Cohort / File(s) Summary
Build Configuration
Makefile
Added kind-deploy PHONY target with KIND_CLUSTER and KIND_NAMESPACE variables; updated kind-load-images to use configurable cluster name.
API Schema & Protobuf
internal/api/manifest/manifest.go, proto/spec/nodepool.proto, manifests/claudie/crd/claudie.io_inputmanifests.yaml
Renamed NvidiaGpu to NvidiaGpuCount, added new NvidiaGpuType field for GPU accelerator type specification; deprecated original field for backward compatibility; updated Kubernetes version constraint to v1.12.1.
Validation Logic
internal/api/manifest/validate_node_pool.go, internal/api/manifest/validate_test.go
Added validateGCPGpuConfig function enforcing that GCP GPU configurations require both NvidiaGpuCount and NvidiaGpuType; introduced comprehensive test coverage for GPU validation across providers.
Manifest Processing
internal/api/manifest/utils.go
Implemented backward-compatible GPU field resolution with fallback from NvidiaGpuCount to deprecated NvidiaGpu; updated MachineSpec mapping to include NvidiaGpuType.
Autoscaler Integration
services/autoscaler-adapter/node_manager/node_manager.go, services/autoscaler-adapter/node_manager/utils.go
Updated GPU count source from NvidiaGpu to NvidiaGpuCount; updated inline comment references.
Documentation
README.md, docs/input-manifest/api-reference.md, docs/input-manifest/gpu-example.md, docs/input-manifest/providers/gcp.md
Added GCP GPU support documentation with field requirements and available GPU types; restructured GPU examples with AWS and GCP sections; introduced GPU Operator deployment guide; updated provider support table to indicate full GCP support.

Possibly related PRs

  • #1854: Modifies MachineSpec GPU fields, protobuf schema, and autoscaler node-manager logic with overlapping changes to GPU configuration handling.
  • #1943: Updates README provider support table GPU entries, creating potential conflict with GCP GPU support status changes.

Suggested labels

test-set-autoscaling, documentation

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(gcp): add GPU guest accelerator support for GCP nodepools' directly and clearly summarizes the main change: adding GPU guest accelerator support for GCP nodepools.
Linked Issues check ✅ Passed The PR implements all core objectives from #1845: adds GPU configuration fields (nvidiaGpuCount, nvidiaGpuType) to MachineSpec, enforces GPU type requirement for GCP via validation, updates autoscaler to use new fields, extends documentation with GCP GPU examples, and maintains backward compatibility.
Out of Scope Changes check ✅ Passed The PR includes the Makefile kind-deploy target and README GCP provider update which are tangential to the core GPU feature but support development/testing and provider support updates respectively.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@samuelstolicny samuelstolicny marked this pull request as ready for review January 28, 2026 08:39
@samuelstolicny samuelstolicny added feature New feature refresh-docs Trigger automatic update of the latest docs version. /refresh-docs comment is also a trigger. labels Jan 28, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/input-manifest/providers/gcp.md`:
- Around line 98-115: Fix the hyphenation in the `nvidia-tesla-v100` description
to "high-performance training" and update the machine-type guidance: remove any
mention of N2/N2D as GPU-capable, state that attached GPUs should use
`n1-standard-*` or `n1-highmem-*` machine types (i.e., N1 supports GPUs), and
add a clarified note that GPU-optimized families such as A2, G2, A3, A4, G4 come
with pre-attached GPU configurations rather than supporting arbitrary attached
GPUs; keep the `nvidia-tesla-*` GPU type list as-is but ensure the "GPU
Availability" and "GPU Instance Limitations" warnings reflect this distinction.
🧹 Nitpick comments (3)
Makefile (1)

95-102: Fail fast in kind-deploy and scope rollout to updated deployments.

Line 97–102: as written, a failed kubectl set image inside the loop can be masked, and the final kubectl rollout status deployment -n $(KIND_NAMESPACE) can block on unrelated deployments in the namespace. Consider failing fast and rolling out per deployment with a timeout.

♻️ Suggested update
 kind-deploy: kind-load-images
 	`@echo` " --- updating deployments in $(KIND_NAMESPACE) namespace --- "
-	`@for` svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
+	`@set` -e; \
+	for svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
 		echo " --- updating $$svc deployment --- "; \
 		kubectl set image deployment/$$svc $$svc=ghcr.io/berops/claudie/$$svc:$(REV) -n $(KIND_NAMESPACE); \
+		kubectl rollout status deployment/$$svc -n $(KIND_NAMESPACE) --timeout=5m; \
 	done
-	`@echo` " --- waiting for rollouts to complete --- "
-	`@kubectl` rollout status deployment -n $(KIND_NAMESPACE)
+	`@echo` " --- rollouts completed --- "
internal/api/manifest/validate_node_pool.go (1)

145-170: Clarify error text for deprecated GPU count usage.

If users still set nvidiaGpu (deprecated), the current message may be confusing. Consider naming both fields.

🛠️ Suggested tweak
-		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount > 0")
+		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount (or deprecated nvidiaGpu) > 0")
internal/api/manifest/validate_test.go (1)

332-440: Consider adding test case for GCP with deprecated NvidiaGpu field.

The test covers the new NvidiaGpuCount field well for GCP and validates backward compatibility for non-GCP providers using the deprecated NvidiaGpu field. However, there's a gap: what happens when a GCP nodepool uses the deprecated NvidiaGpu field without specifying NvidiaGpuType?

If the validation logic correctly considers both fields when determining GPU presence, this scenario should also fail for GCP. Adding this test case would ensure the deprecated field path is also validated consistently for GCP.

Suggested additional test case
 	r.NoError(hetznerNodepoolDeprecatedGpu.Validate(hetznerManifest), "Non-GCP nodepool with deprecated nvidiaGpu but no type should pass validation")
+
+	// Test case 6: GCP nodepool with deprecated nvidiaGpu field but no type - should fail (GCP requires type regardless of which field is used)
+	gcpNodepoolDeprecatedGpuNoType := &DynamicNodePool{
+		Name:       "gpu-np-dep",
+		ServerType: "n1-standard-4",
+		Image:      "ubuntu-2204",
+		Count:      1,
+		ProviderSpec: ProviderSpec{
+			Name:   "gcp-1",
+			Region: "us-central1",
+			Zone:   "us-central1-a",
+		},
+		MachineSpec: &MachineSpec{
+			NvidiaGpu: 1, // Using deprecated field
+		},
+	}
+	r.Error(gcpNodepoolDeprecatedGpuNoType.Validate(gcpManifest), "GCP nodepool with deprecated nvidiaGpu but no type should fail validation")
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature refresh-docs Trigger automatic update of the latest docs version. /refresh-docs comment is also a trigger.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Enable GPU acceleration for instances where instance-type does not automatically enable the GPU's

2 participants