feat(gcp): add GPU guest accelerator support for GCP nodepools #1952

samuelstolicny · 2026-01-27T15:53:19Z

Summary

This PR adds GPU guest accelerator support for GCP nodepools, allowing users to attach NVIDIA GPUs to GCP compute instances via Claudie's InputManifest.

Changes

Proto: Added nvidiaGpuCount (renamed from nvidiaGpu) and nvidiaGpuType fields to MachineSpec
Manifest: Updated MachineSpec struct with new GPU fields and backward compatibility for nvidiaGpu
Validation: Added validateGCPGpuConfig() to ensure GCP nodepools with GPUs have the required type specified
CRD: Regenerated with new nvidiaGpuCount and nvidiaGpuType fields
Autoscaler: Updated to use NvidiaGpuCount field name
Documentation: Added GPU configuration docs for GCP provider
Makefile: Added kind-deploy target for easier local testing

Example Usage

nodePools:
  dynamic:
    - name: gpu-workers
      providerSpec:
        name: gcp-provider
        region: europe-west1
        zone: europe-west1-b
      count: 1
      serverType: n1-standard-4
      image: ubuntu-2204-lts
      machineSpec:
        nvidiaGpuCount: 1
        nvidiaGpuType: nvidia-tesla-t4

Backward Compatibility

The old nvidiaGpu field is preserved as a deprecated alias for nvidiaGpuCount
Non-GCP providers can still use GPU count without specifying type
Proto field numbers maintained for wire compatibility

Template Changes

GCP template changes implemented in berops/claudie-config#22

Summary by CodeRabbit

Release Notes

New Features
- GCP now supports NVIDIA GPUs with configurable GPU types
- Added comprehensive GPU Operator deployment guide with step-by-step instructions
Documentation
- Updated GPU configuration examples for AWS and GCP with detailed machineSpec specifications
- Added GCP GPU support documentation including available GPU types and deployment considerations
- Updated provider support matrix reflecting full GCP feature support

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add support for attaching NVIDIA GPUs to GCP compute instances via the guest_accelerator block. GCP requires explicit GPU type and count configuration unlike other providers where GPU enabled instance types automatically include GPUs.

… kind cluster

coderabbitai · 2026-01-27T15:53:28Z

Walkthrough

This PR enables GPU acceleration support for cloud providers where instance types don't automatically include GPUs. The changes introduce GPU count and type fields to the MachineSpec schema, add GCP-specific GPU configuration validation, update documentation, and refactor existing GPU field references throughout the codebase to support the new naming convention while maintaining backward compatibility.

Changes

Cohort / File(s)	Summary
Build Configuration `Makefile`	Added `kind-deploy` PHONY target with `KIND_CLUSTER` and `KIND_NAMESPACE` variables; updated `kind-load-images` to use configurable cluster name.
API Schema & Protobuf `internal/api/manifest/manifest.go`, `proto/spec/nodepool.proto`, `manifests/claudie/crd/claudie.io_inputmanifests.yaml`	Renamed `NvidiaGpu` to `NvidiaGpuCount`, added new `NvidiaGpuType` field for GPU accelerator type specification; deprecated original field for backward compatibility; updated Kubernetes version constraint to v1.12.1.
Validation Logic `internal/api/manifest/validate_node_pool.go`, `internal/api/manifest/validate_test.go`	Added `validateGCPGpuConfig` function enforcing that GCP GPU configurations require both `NvidiaGpuCount` and `NvidiaGpuType`; introduced comprehensive test coverage for GPU validation across providers.
Manifest Processing `internal/api/manifest/utils.go`	Implemented backward-compatible GPU field resolution with fallback from `NvidiaGpuCount` to deprecated `NvidiaGpu`; updated MachineSpec mapping to include `NvidiaGpuType`.
Autoscaler Integration `services/autoscaler-adapter/node_manager/node_manager.go`, `services/autoscaler-adapter/node_manager/utils.go`	Updated GPU count source from `NvidiaGpu` to `NvidiaGpuCount`; updated inline comment references.
Documentation `README.md`, `docs/input-manifest/api-reference.md`, `docs/input-manifest/gpu-example.md`, `docs/input-manifest/providers/gcp.md`	Added GCP GPU support documentation with field requirements and available GPU types; restructured GPU examples with AWS and GCP sections; introduced GPU Operator deployment guide; updated provider support table to indicate full GCP support.

Possibly related PRs

#1854: Modifies MachineSpec GPU fields, protobuf schema, and autoscaler node-manager logic with overlapping changes to GPU configuration handling.
#1943: Updates README provider support table GPU entries, creating potential conflict with GCP GPU support status changes.

Suggested labels

test-set-autoscaling, documentation

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(gcp): add GPU guest accelerator support for GCP nodepools' directly and clearly summarizes the main change: adding GPU guest accelerator support for GCP nodepools.
Linked Issues check	✅ Passed	The PR implements all core objectives from `#1845`: adds GPU configuration fields (nvidiaGpuCount, nvidiaGpuType) to MachineSpec, enforces GPU type requirement for GCP via validation, updates autoscaler to use new fields, extends documentation with GCP GPU examples, and maintains backward compatibility.
Out of Scope Changes check	✅ Passed	The PR includes the Makefile kind-deploy target and README GCP provider update which are tangential to the core GPU feature but support development/testing and provider support updates respectively.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/input-manifest/providers/gcp.md`:
- Around line 98-115: Fix the hyphenation in the `nvidia-tesla-v100` description
to "high-performance training" and update the machine-type guidance: remove any
mention of N2/N2D as GPU-capable, state that attached GPUs should use
`n1-standard-*` or `n1-highmem-*` machine types (i.e., N1 supports GPUs), and
add a clarified note that GPU-optimized families such as A2, G2, A3, A4, G4 come
with pre-attached GPU configurations rather than supporting arbitrary attached
GPUs; keep the `nvidia-tesla-*` GPU type list as-is but ensure the "GPU
Availability" and "GPU Instance Limitations" warnings reflect this distinction.

🧹 Nitpick comments (3)

Makefile (1)
95-102: Fail fast in kind-deploy and scope rollout to updated deployments.

Line 97–102: as written, a failed kubectl set image inside the loop can be masked, and the final kubectl rollout status deployment -n $(KIND_NAMESPACE) can block on unrelated deployments in the namespace. Consider failing fast and rolling out per deployment with a timeout.
♻️ Suggested update
 kind-deploy: kind-load-images
 	`@echo` " --- updating deployments in $(KIND_NAMESPACE) namespace --- "
-	`@for` svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
+	`@set` -e; \
+	for svc in ansibler builder claudie-operator kube-eleven kuber manager terraformer; do \
 		echo " --- updating $$svc deployment --- "; \
 		kubectl set image deployment/$$svc $$svc=ghcr.io/berops/claudie/$$svc:$(REV) -n $(KIND_NAMESPACE); \
+		kubectl rollout status deployment/$$svc -n $(KIND_NAMESPACE) --timeout=5m; \
 	done
-	`@echo` " --- waiting for rollouts to complete --- "
-	`@kubectl` rollout status deployment -n $(KIND_NAMESPACE)
+	`@echo` " --- rollouts completed --- "
internal/api/manifest/validate_node_pool.go (1)
145-170: Clarify error text for deprecated GPU count usage.

If users still set nvidiaGpu (deprecated), the current message may be confusing. Consider naming both fields.
🛠️ Suggested tweak
-		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount > 0")
+		return fmt.Errorf("nvidiaGpuType is required for GCP when nvidiaGpuCount (or deprecated nvidiaGpu) > 0")
internal/api/manifest/validate_test.go (1)
332-440: Consider adding test case for GCP with deprecated NvidiaGpu field.

The test covers the new NvidiaGpuCount field well for GCP and validates backward compatibility for non-GCP providers using the deprecated NvidiaGpu field. However, there's a gap: what happens when a GCP nodepool uses the deprecated NvidiaGpu field without specifying NvidiaGpuType?

If the validation logic correctly considers both fields when determining GPU presence, this scenario should also fail for GCP. Adding this test case would ensure the deprecated field path is also validated consistently for GCP.
Suggested additional test case
 	r.NoError(hetznerNodepoolDeprecatedGpu.Validate(hetznerManifest), "Non-GCP nodepool with deprecated nvidiaGpu but no type should pass validation")
+
+	// Test case 6: GCP nodepool with deprecated nvidiaGpu field but no type - should fail (GCP requires type regardless of which field is used)
+	gcpNodepoolDeprecatedGpuNoType := &DynamicNodePool{
+		Name:       "gpu-np-dep",
+		ServerType: "n1-standard-4",
+		Image:      "ubuntu-2204",
+		Count:      1,
+		ProviderSpec: ProviderSpec{
+			Name:   "gcp-1",
+			Region: "us-central1",
+			Zone:   "us-central1-a",
+		},
+		MachineSpec: &MachineSpec{
+			NvidiaGpu: 1, // Using deprecated field
+		},
+	}
+	r.Error(gcpNodepoolDeprecatedGpuNoType.Validate(gcpManifest), "GCP nodepool with deprecated nvidiaGpu but no type should fail validation")
 }

docs/input-manifest/providers/gcp.md

Samuel STOLICNY (contractor) added 2 commits January 27, 2026 16:24

feat: add kind-deploy target to load images and update deployments in…

28bd016

… kind cluster

samuelstolicny mentioned this pull request Jan 27, 2026

feat(gcp): add guest_accelerator block for GPU support berops/claudie-config#22

Open

samuelstolicny marked this pull request as ready for review January 28, 2026 08:39

samuelstolicny requested review from Despire, JKBGIT1 and m-brando January 28, 2026 08:39

samuelstolicny added feature New feature refresh-docs Trigger automatic update of the latest docs version. /refresh-docs comment is also a trigger. labels Jan 28, 2026

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

docs/input-manifest/providers/gcp.md Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gcp): add GPU guest accelerator support for GCP nodepools #1952

feat(gcp): add GPU guest accelerator support for GCP nodepools #1952

samuelstolicny commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(gcp): add GPU guest accelerator support for GCP nodepools #1952

Are you sure you want to change the base?

feat(gcp): add GPU guest accelerator support for GCP nodepools #1952

Conversation

samuelstolicny commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Example Usage

Backward Compatibility

Template Changes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samuelstolicny commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading