Skip to content

feat: Switch (NVSwitch) state machine#624

Open
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/switch
Open

feat: Switch (NVSwitch) state machine#624
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/switch

Conversation

@vinodchitraliNVIDIA
Copy link

@vinodchitraliNVIDIA vinodchitraliNVIDIA commented Mar 18, 2026

A state machine for switch : creation → configuration → validation → BOM validation → ready, with optional reprovisioning and deletion.

States and flow

  • Initializing – Switch created in Carbide; controller does initial setup.
  • Configuring – Single sub-state RotateOsPassword; then move to Validating.
  • Validating – Sub-state ValidateComplete; then move to BomValidating.
  • BomValidating – Sub-state BomValidateComplete; then move to Ready.
  • Ready – Switch usable; can be deleted or sent to ReProvisioning.
  • ReProvisioning – Sub-states Start → WaitFirmwareUpdateCompletion; completion via firmware_upgrade_status (Completed → Ready, Failed → Error).
  • Error – Can transition to Deleting if marked for deletion.
  • Deleting – Removal; terminal state.

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-18 20:48:13 UTC | Commit: a9b5e23

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 72
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-18 20:48:21 UTC | Commit: a9b5e23

Introduce a state machine for switch : creation → configuration → validation → BOM validation → ready, with optional reprovisioning and deletion.

States and flow

Initializing – Switch created in Carbide; controller does initial setup.
Configuring – Single sub-state RotateOsPassword; then move to Validating.
Validating – Sub-state ValidateComplete; then move to BomValidating.
BomValidating – Sub-state BomValidateComplete; then move to Ready.
Ready – Switch usable; can be deleted or sent to ReProvisioning.
ReProvisioning – Sub-states Start → WaitFirmwareUpdateCompletion; completion via firmware_upgrade_status (Completed → Ready, Failed → Error).
Error – Can transition to Deleting if marked for deletion.
Deleting – Removal; terminal state.

Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
| **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. |
| **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. |
| **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). |
| **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then what? What is manual intervention and which state does it go to (i.e. how do you get out of the error state)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.

The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed switch state will reflect in Rack sate also. Once switch issue is fixed, on-demand rack firmware update will be triggered. This will make sure delta trays moves to ready state and hence rack too

|-------|-------------|
| **Initializing** | Switch is created in Carbide; controller performs initial setup. |
| **Configuring** | Switch is being configured (rotate OS password). Sub-state: `RotateOsPassword`. |
| **Validating** | Switch is being validated. Sub-state: `ValidateComplete`. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specifically what's being validated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@narasimhan321 is working on set of test case. Will add them shortly

| ReProvisioning (Start) | ReProvisioning (WaitFirmwareUpdateCompletion) | Reprovision triggered |
| ReProvisioning (WaitFirmwareUpdateCompletion) | Ready | `firmware_upgrade_status == Completed` |
| ReProvisioning (WaitFirmwareUpdateCompletion) | Error | `firmware_upgrade_status == Failed { cause }` |
| Error | Deleting | `deleted` set (marked for deletion) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get out of Error without deleting the switch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted .. will change

"switch_serial_number",
"nvos_mac_address",
"nvos_username",
"nvos_password",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe to keep in PG unencrypted? (i.e. are they well known defaults?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are well know default simillar to BMC creds. nvos_password will be rotated during the ingestion

@@ -0,0 +1,3 @@
-- Add nvos_mac_address column to expected_switches table (NVOS host MAC, similar to bmc_mac_address).
ALTER TABLE expected_switches
ADD COLUMN nvos_mac_address macaddr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both nvos_mac_address and bmc_mac_address. You really only need expected_switches to know how to login to it's BMC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is bug in nv switch redfish. In redfish oob there is no info available on cabled eth port on nvos. Without nvos eth0/1 mac address we cant assoicate explored Switch. Further more we cant upgrade firmware

And we cant install anything on nvos to collect mapping info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we then add both addresses into expected switches?

.await
.map_err(|e| DatabaseError::new("end find_all_preingestion_complete data", e))?;
let mut managed_switches = Vec::new();
for ep in explored_endpoints.into_iter() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .filter_map() was better. But I just hate for loops.

// ) -> CarbideResult<()> {
// //TODO Add this later when and if required
// Ok(())
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented Code.

_state: &mut Switch,
ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>,
) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> {
tracing::info!("Deleting Switch");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this log message actually says which switch is getting deleted (and what caused it). The tracing macro may annotate the log properly, but I can't tell from here.

state: &mut Switch,
_ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>,
) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> {
tracing::info!("Switch is in error state");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about more informational log entries (and anywhere else).

));
}

if is_switch_reprovisioning_requested(state) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What system is responsible for knowing that there's no users using any of the connected machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rack state machine will perform the actual update, and it would wait until all hosts would move out of Assigned/Ready into a guard state.

So this seems ok

@ajf ajf requested review from Matthias247 and chet March 19, 2026 16:44
optional string nvos_password = 8;
// Unique identifier for the expected switch. When omitted, server generates one.
optional common.UUID expected_switch_id = 9;
optional string nvos_mac_address = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it? Can't it be discovered dynamically? The only reason we would need it is if we wanted to validate the MAC address is the right one.

I also think there's 2 management ports on the switch, so if we go for it, it should be 2.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained earlier

| **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. |
| **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. |
| **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). |
| **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.

The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.

}
}

/// A combination of DPU and host that was discovered via Site Exploration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated

/// The Switch's BMC IP
pub bmc_ip: IpAddr,
// Host mac address
pub nv_os_mac_addresses: Vec<MacAddress>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switches have 2 management ports and thereby 2 MACs.
I'm however wondering if we need this field here, or whether it's already part of ExplorationReport

Comment on lines +101 to +112
if !created {
txn.commit()
.await
.map_err(|e| DatabaseError::new("commit create_managed_switch", e))?;
return Ok(false);
}

txn.commit()
.await
.map_err(|e| DatabaseError::new("commit create_managed_switch", e))?;

Ok(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems equivalent to

txn.commit()
    .await
    .map_err(|e| DatabaseError::new("commit create_managed_switch", e))?;
Ok(created)

let existing_switch = db::switch::find_by_id(txn, &switch_id).await?;

if let Some(_existing_switch) = existing_switch {
//Possibly multiple eth ports are connected?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be desirable. The log makes it sounds like a problem.

In any case, it seems like the check is somewhat duplicated. Just having either the MAC address or switch ID check seems good enough. Switch ID seems best to me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see that

code will never hit

name,
enable_nmxc: false,
fabric_manager_config: None,
location: Some("US/CA/DC/San Jose/1000 N Mathilda Ave".to_string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should be Metadata on the switch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location to come from rack placement. Keeping migrated code from mods.rs

};
let config = model::switch::SwitchConfig {
name,
enable_nmxc: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields look like something that is set in carbide configuration and could be changed at runtime (opposed to ingestion time)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check. Just wodering if we want to have per rack enable/disable nmc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need it for migrating some existing deployments. But maybe for these we also just wouldn't have Rack + Switch support enabled?

));
}

if is_switch_reprovisioning_requested(state) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rack state machine will perform the actual update, and it would wait until all hosts would move out of Assigned/Ready into a guard state.

So this seems ok

);
}

let switch_handler = Arc::new(SwitchStateHandler::default());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the run_single_iteration() flows that are used in all other places - e..g. for host tests. In these we can check things step by step much easier since there's no timing constraints. And you also won't need to construct all these handlers and state controllers another time. They are already part of TestEnv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants