feat: status-updater component controller (RUN-38194)#186
Open
feat: status-updater component controller (RUN-38194)#186
Conversation
BuildGpuOperatorValues generates Helm values that scope GPU Operator DaemonSets to mock pool nodes via aggregated node affinity selectors, disables driver/toolkit, and adds managed-by labels.
Add HelmManager interface for GPU Operator lifecycle, extend ReconcileParams with NodePoolLabelKey/GpuOperatorChartVersion/HelmManager, and wire Sync into the reconciler after K8s resource reconciliation. Chart version from topology CM overrides the default.
Enable componentController in profiles E2E values and add tests that verify per-pool kwok-gpu-device-plugin and kwok-status-exporter deployments are created with managed-by labels.
…UN-38194) When statusUpdater.componentController.enabled is true, skip static Helm templates for kwok-gpu-device-plugin, kwok-status-exporter, and kwok-dra-plugin deployments/services — the controller manages them.
- Fix setup.sh: load kwok-gpu-device-plugin image into kind, pass fallbackImageTag to helm install, wait for controller-managed deployments - Expand E2E tests: verify per-pool deployments/services with correct labels and images, mock pool exclusion, static template gating, and pod health for all managed resources
- Make ImagePullPolicy configurable via IMAGE_PULL_POLICY env var so kind clusters can use Never instead of Always - Update 5 kwok-gpu-device-plugin RBAC templates to remain active when component controller is enabled (using `or` condition) - Add robust wait_for_pods helper in setup.sh for retry-based pod waits - Conditionally skip static kwok-dra-plugin wait when controller manages it - Add kwok-gpu-device-plugin to kind image loading - Detect component controller mode in E2E and skip static deployment tests
Add constants, types, image resolution, resource diffing, desired state computation, controller tests, and RBAC for the component controller. Includes design doc.
Drop the runtime Helm-managed path for the GPU Operator (helm.go, HelmManager interface, BuildGpuOperatorValues, CollectMockPools, and the reconciler.Sync wiring) along with the cross-pool coexistence E2E that exercised it. Mock-pool support will be delivered in Phase 5 (RUN-38195) via the GPU Operator as a Helm subchart dependency, not runtime install/upgrade from inside status-updater. The subchart approach avoids pulling the Helm SDK into the controller binary, keeps RBAC narrow, and matches standard Helm dependency patterns. The component controller's fake-pool reconciliation and BackendMock constant remain — only the GPU Operator runtime-Helm scaffolding is removed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Evolves the status-updater into a central controller that programmatically manages per-pool KWOK component deployments (device-plugin, status-exporter, DRA plugin), replacing static Helm templates.
fakebackend pools, whilemockpools are left to the GPU Operator Helm releaseComponentsConfigin the topology CM, falling back to registry + tag defaultsstatusUpdater.componentController.enabled— static KWOK templates are gated off when enabled; RBAC templates useorto work in both modesKey files
internal/status-updater/controllers/component/— controller, reconciler, resource builders, diff, image resolutioninternal/common/topology/types.go—ComponentsConfigtypesinternal/common/constants/constants.go— backend types, managed-resource labelsdeploy/fake-gpu-operator/templates/— feature-gated static templates, RBAC updatesTest plan