feat: Switch (NVSwitch) state machine by vinodchitraliNVIDIA · Pull Request #624 · NVIDIA/ncx-infra-controller-core

vinodchitraliNVIDIA · 2026-03-18T20:45:38Z

A state machine for switch : creation → configuration → validation → BOM validation → ready, with optional reprovisioning and deletion.

States and flow

Initializing – Switch created in Carbide; controller does initial setup.
Configuring – Single sub-state RotateOsPassword; then move to Validating.
Validating – Sub-state ValidateComplete; then move to BomValidating.
BomValidating – Sub-state BomValidateComplete; then move to Ready.
Ready – Switch usable; can be deleted or sent to ReProvisioning.
ReProvisioning – Sub-states Start → WaitFirmwareUpdateCompletion; completion via firmware_upgrade_status (Completed → Ready, Failed → Error).
Error – Can transition to Deleting if marked for deletion.
Deleting – Removal; terminal state.

Description

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

github-actions · 2026-03-18T20:48:14Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-03-18 20:48:13 UTC | Commit: a9b5e23}

github-actions · 2026-03-18T20:48:22Z

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

🔴 Critical/High: 72
🟡 Medium: 0
🔵 Low/Info: 0

🔗 View full details in Security tab

_{🕐 Last updated: 2026-03-18 20:48:21 UTC | Commit: a9b5e23}

Introduce a state machine for switch : creation → configuration → validation → BOM validation → ready, with optional reprovisioning and deletion. States and flow Initializing – Switch created in Carbide; controller does initial setup. Configuring – Single sub-state RotateOsPassword; then move to Validating. Validating – Sub-state ValidateComplete; then move to BomValidating. BomValidating – Sub-state BomValidateComplete; then move to Ready. Ready – Switch usable; can be deleted or sent to ReProvisioning. ReProvisioning – Sub-states Start → WaitFirmwareUpdateCompletion; completion via firmware_upgrade_status (Completed → Ready, Failed → Error). Error – Can transition to Deleting if marked for deletion. Deleting – Removal; terminal state. Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>

ajf · 2026-03-19T16:33:50Z

book/src/architecture/state_machines/switch.md

+| **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. |
+| **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. |
+| **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). |
+| **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. |


And then what? What is manual intervention and which state does it go to (i.e. how do you get out of the error state)?

agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.

The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.

Failed switch state will reflect in Rack sate also. Once switch issue is fixed, on-demand rack firmware update will be triggered. This will make sure delta trays moves to ready state and hence rack too

ajf · 2026-03-19T16:34:18Z

book/src/architecture/state_machines/switch.md

+|-------|-------------|
+| **Initializing** | Switch is created in Carbide; controller performs initial setup. |
+| **Configuring** | Switch is being configured (rotate OS password). Sub-state: `RotateOsPassword`. |
+| **Validating** | Switch is being validated. Sub-state: `ValidateComplete`. |


Be more specifically what's being validated.

@narasimhan321 is working on set of test case. Will add them shortly

ajf · 2026-03-19T16:35:10Z

book/src/architecture/state_machines/switch.md

+| ReProvisioning (Start) | ReProvisioning (WaitFirmwareUpdateCompletion) | Reprovision triggered |
+| ReProvisioning (WaitFirmwareUpdateCompletion) | Ready | `firmware_upgrade_status == Completed` |
+| ReProvisioning (WaitFirmwareUpdateCompletion) | Error | `firmware_upgrade_status == Failed { cause }` |
+| Error | Deleting | `deleted` set (marked for deletion) |


You should be able to get out of Error without deleting the switch.

noted .. will change

ajf · 2026-03-19T16:36:14Z

crates/admin-cli/src/expected_switch/update/args.rs

 "switch_serial_number",
+"nvos_mac_address",
+"nvos_username",
+"nvos_password",


Is this safe to keep in PG unencrypted? (i.e. are they well known defaults?)

they are well know default simillar to BMC creds. nvos_password will be rotated during the ingestion

ajf · 2026-03-19T16:37:04Z

crates/api-db/migrations/20260316120002_expected_switches_nvos_mac_address.sql

@@ -0,0 +1,3 @@
+-- Add nvos_mac_address column to expected_switches table (NVOS host MAC, similar to bmc_mac_address).
+ALTER TABLE expected_switches
+    ADD COLUMN nvos_mac_address macaddr;


Why do we need both nvos_mac_address and bmc_mac_address. You really only need expected_switches to know how to login to it's BMC.

There is bug in nv switch redfish. In redfish oob there is no info available on cabled eth port on nvos. Without nvos eth0/1 mac address we cant assoicate explored Switch. Further more we cant upgrade firmware

And we cant install anything on nvos to collect mapping info

should we then add both addresses into expected switches?

ajf · 2026-03-19T16:40:26Z

crates/api/src/site_explorer/mod.rs

            .await
            .map_err(|e| DatabaseError::new("end find_all_preingestion_complete data", e))?;
+        let mut managed_switches = Vec::new();
+        for ep in explored_endpoints.into_iter() {


The .filter_map() was better. But I just hate for loops.

ajf · 2026-03-19T16:40:47Z

crates/api/src/site_explorer/switch_creator.rs

+    // ) -> CarbideResult<()> {
+    //     //TODO Add this later when and if required
+    //     Ok(())
+    // }


Delete commented Code.

ajf · 2026-03-19T16:41:52Z

crates/api/src/state_controller/switch/deleting.rs

+    _state: &mut Switch,
+    ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>,
+) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> {
+    tracing::info!("Deleting Switch");


Make sure this log message actually says which switch is getting deleted (and what caused it). The tracing macro may annotate the log properly, but I can't tell from here.

ajf · 2026-03-19T16:42:10Z

crates/api/src/state_controller/switch/error_state.rs

+    state: &mut Switch,
+    _ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>,
+) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> {
+    tracing::info!("Switch is in error state");


Same comment about more informational log entries (and anywhere else).

ajf · 2026-03-19T16:43:18Z

crates/api/src/state_controller/switch/ready.rs

+        ));
+    }
+
+    if is_switch_reprovisioning_requested(state) {


What system is responsible for knowing that there's no users using any of the connected machines.

the rack state machine will perform the actual update, and it would wait until all hosts would move out of Assigned/Ready into a guard state.

So this seems ok

Matthias247 · 2026-03-19T16:50:50Z

crates/rpc/proto/forge.proto

  optional string nvos_password = 8;
  // Unique identifier for the expected switch. When omitted, server generates one.
  optional common.UUID expected_switch_id = 9;
+  optional string nvos_mac_address = 10;


Why do we need it? Can't it be discovered dynamically? The only reason we would need it is if we wanted to validate the MAC address is the right one.

I also think there's 2 management ports on the switch, so if we go for it, it should be 2.

explained earlier

Matthias247 · 2026-03-19T16:59:49Z

book/src/architecture/state_machines/switch.md

+| **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. |
+| **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. |
+| **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). |
+| **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. |


agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.

The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.

Matthias247 · 2026-03-19T17:01:35Z

crates/api-model/src/site_explorer/mod.rs

    }
 }

+/// A combination of DPU and host that was discovered via Site Exploration


Needs to be updated

Matthias247 · 2026-03-19T17:01:53Z

crates/api-model/src/site_explorer/mod.rs

+    /// The Switch's BMC IP
+    pub bmc_ip: IpAddr,
+    // Host mac address
+    pub nv_os_mac_addresses: Vec<MacAddress>,


switches have 2 management ports and thereby 2 MACs.
I'm however wondering if we need this field here, or whether it's already part of ExplorationReport

Matthias247 · 2026-03-19T17:05:22Z

crates/api/src/site_explorer/switch_creator.rs

+        if !created {
+            txn.commit()
+                .await
+                .map_err(|e| DatabaseError::new("commit create_managed_switch", e))?;
+            return Ok(false);
+        }
+
+        txn.commit()
+            .await
+            .map_err(|e| DatabaseError::new("commit create_managed_switch", e))?;
+
+        Ok(true)


This seems equivalent to

txn.commit() .await .map_err(|e| DatabaseError::new("commit create_managed_switch", e))?; Ok(created)

Matthias247 · 2026-03-19T17:09:56Z

crates/api/src/site_explorer/switch_creator.rs

+        let existing_switch = db::switch::find_by_id(txn, &switch_id).await?;
+
+        if let Some(_existing_switch) = existing_switch {
+            //Possibly multiple eth ports are connected?


That seems to be desirable. The log makes it sounds like a problem.

In any case, it seems like the check is somewhat duplicated. Just having either the MAC address or switch ID check seems good enough. Switch ID seems best to me.

i see that

code will never hit

Matthias247 · 2026-03-19T17:11:25Z

crates/api/src/site_explorer/switch_creator.rs

+            name,
+            enable_nmxc: false,
+            fabric_manager_config: None,
+            location: Some("US/CA/DC/San Jose/1000 N Mathilda Ave".to_string()),


that should be Metadata on the switch?

location to come from rack placement. Keeping migrated code from mods.rs

Matthias247 · 2026-03-19T17:11:58Z

crates/api/src/site_explorer/switch_creator.rs

+        };
+        let config = model::switch::SwitchConfig {
+            name,
+            enable_nmxc: false,


These fields look like something that is set in carbide configuration and could be changed at runtime (opposed to ingestion time)?

will check. Just wodering if we want to have per rack enable/disable nmc

we might need it for migrating some existing deployments. But maybe for these we also just wouldn't have Rack + Switch support enabled?

Matthias247 · 2026-03-19T17:14:16Z

crates/api/src/state_controller/switch/ready.rs

+        ));
+    }
+
+    if is_switch_reprovisioning_requested(state) {


the rack state machine will perform the actual update, and it would wait until all hosts would move out of Assigned/Ready into a guard state.

So this seems ok

Matthias247 · 2026-03-19T17:16:17Z

crates/api/src/tests/switch_state_controller/mod.rs

+        );
+    }
+
+    let switch_handler = Arc::new(SwitchStateHandler::default());


please use the run_single_iteration() flows that are used in all other places - e..g. for host tests. In these we can check things step by step much easier since there's no timing constraints. And you also won't need to construct all these handlers and state controllers another time. They are already part of TestEnv.

vinodchitraliNVIDIA force-pushed the vc/switch branch from a9b5e23 to 333cc28 Compare March 19, 2026 16:32

vinodchitraliNVIDIA requested a review from a team as a code owner March 19, 2026 16:32

ajf reviewed Mar 19, 2026

View reviewed changes

ajf requested review from Matthias247 and chet March 19, 2026 16:44

Matthias247 reviewed Mar 19, 2026

View reviewed changes

Conversation

vinodchitraliNVIDIA commented Mar 18, 2026 • edited by ajf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

github-actions bot commented Mar 18, 2026

🔐 TruffleHog Secret Scan

Uh oh!

github-actions bot commented Mar 18, 2026

🛡️ Vulnerability Scan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinodchitraliNVIDIA commented Mar 18, 2026 •

edited by ajf

Loading