Skip to content

feat: status-updater component controller (RUN-38194)#186

Open
eliranw wants to merge 10 commits intomainfrom
eliranw/RUN-38194-status-updater-controller
Open

feat: status-updater component controller (RUN-38194)#186
eliranw wants to merge 10 commits intomainfrom
eliranw/RUN-38194-status-updater-controller

Conversation

@eliranw
Copy link
Copy Markdown
Contributor

@eliranw eliranw commented Apr 23, 2026

Summary

Evolves the status-updater into a central controller that programmatically manages per-pool KWOK component deployments (device-plugin, status-exporter, DRA plugin), replacing static Helm templates.

  • Component controller watches the topology ConfigMap and reconciles per-pool Deployments and Services for fake backend pools, while mock pools are left to the GPU Operator Helm release
  • Image resolution with per-component overrides via ComponentsConfig in the topology CM, falling back to registry + tag defaults
  • Resource diffing creates/updates/deletes deployments and services to match desired state
  • Feature-flagged via statusUpdater.componentController.enabled — static KWOK templates are gated off when enabled; RBAC templates use or to work in both modes
  • ImagePullPolicy propagation from Helm values through to controller-created deployments (needed for kind clusters using preloaded images)
  • HelmManager interface abstracts GPU Operator Helm lifecycle for mock pools, with NoopHelmManager as safe default

Key files

  • internal/status-updater/controllers/component/ — controller, reconciler, resource builders, diff, image resolution
  • internal/common/topology/types.goComponentsConfig types
  • internal/common/constants/constants.go — backend types, managed-resource labels
  • deploy/fake-gpu-operator/templates/ — feature-gated static templates, RBAC updates

Test plan

  • 32 unit tests covering resource builders, diffing, image resolution, reconciler, controller, and Helm manager
  • 10 E2E tests verifying controller-managed deployments, services, images, pod availability, mock pool exclusion, and static template gating
  • E2E results: 31/37 passed, 4 failed (pre-existing DRA plugin issues), 2 skipped (static tests correctly skipped in controller mode)

eliranw added 8 commits April 23, 2026 15:11
BuildGpuOperatorValues generates Helm values that scope GPU Operator
DaemonSets to mock pool nodes via aggregated node affinity selectors,
disables driver/toolkit, and adds managed-by labels.
Add HelmManager interface for GPU Operator lifecycle, extend
ReconcileParams with NodePoolLabelKey/GpuOperatorChartVersion/HelmManager,
and wire Sync into the reconciler after K8s resource reconciliation.
Chart version from topology CM overrides the default.
Enable componentController in profiles E2E values and add tests that
verify per-pool kwok-gpu-device-plugin and kwok-status-exporter
deployments are created with managed-by labels.
…UN-38194)

When statusUpdater.componentController.enabled is true, skip static
Helm templates for kwok-gpu-device-plugin, kwok-status-exporter, and
kwok-dra-plugin deployments/services — the controller manages them.
- Fix setup.sh: load kwok-gpu-device-plugin image into kind, pass
  fallbackImageTag to helm install, wait for controller-managed deployments
- Expand E2E tests: verify per-pool deployments/services with correct
  labels and images, mock pool exclusion, static template gating,
  and pod health for all managed resources
- Make ImagePullPolicy configurable via IMAGE_PULL_POLICY env var so
  kind clusters can use Never instead of Always
- Update 5 kwok-gpu-device-plugin RBAC templates to remain active when
  component controller is enabled (using `or` condition)
- Add robust wait_for_pods helper in setup.sh for retry-based pod waits
- Conditionally skip static kwok-dra-plugin wait when controller manages it
- Add kwok-gpu-device-plugin to kind image loading
- Detect component controller mode in E2E and skip static deployment tests
Add constants, types, image resolution, resource diffing, desired state
computation, controller tests, and RBAC for the component controller.
Includes design doc.
@eliranw eliranw requested a review from a team as a code owner April 23, 2026 13:28
eliranw added 2 commits April 27, 2026 09:27
Drop the runtime Helm-managed path for the GPU Operator (helm.go,
HelmManager interface, BuildGpuOperatorValues, CollectMockPools, and
the reconciler.Sync wiring) along with the cross-pool coexistence E2E
that exercised it. Mock-pool support will be delivered in Phase 5
(RUN-38195) via the GPU Operator as a Helm subchart dependency, not
runtime install/upgrade from inside status-updater. The subchart
approach avoids pulling the Helm SDK into the controller binary,
keeps RBAC narrow, and matches standard Helm dependency patterns.

The component controller's fake-pool reconciliation and BackendMock
constant remain — only the GPU Operator runtime-Helm scaffolding is
removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant