-
Notifications
You must be signed in to change notification settings - Fork 67
feat: Switch (NVSwitch) state machine #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # Switch State Diagram | ||
|
|
||
| This document describes the Finite State Machine (FSM) for Switches in Carbide: lifecycle from creation through configuration, validation, ready, optional reprovisioning, and deletion. | ||
|
|
||
| ## High-Level Overview | ||
|
|
||
| The main flow shows the primary states and transitions: | ||
|
|
||
| <div style="width: 180%; background: white; margin-left: -40%;"> | ||
|
|
||
| ```plantuml | ||
| @startuml | ||
| skinparam state { | ||
| BackgroundColor White | ||
| } | ||
|
|
||
| state "Initializing" as Initializing | ||
| state "Configuring\n(RotateOsPassword)" as Configuring | ||
| state "Validating" as Validating | ||
| state "BomValidating" as BomValidating | ||
| state "Ready" as Ready | ||
| state "ReProvisioning\n(Start → WaitFirmware)" as ReProvisioning | ||
| state "Error" as Error | ||
| state "Deleting" as Deleting | ||
|
|
||
| [*] --> Initializing : Switch created | ||
|
|
||
| Initializing --> Configuring : init complete | ||
| Configuring --> Validating : rotate password done | ||
| Validating --> BomValidating : validation complete | ||
| BomValidating --> Ready : BOM validation complete | ||
|
|
||
| Ready --> Deleting : marked for deletion | ||
| Ready --> ReProvisioning : reprovision requested | ||
|
|
||
| ReProvisioning --> Ready : firmware upgrade Completed | ||
| ReProvisioning --> Error : firmware upgrade Failed | ||
|
|
||
| Error --> Deleting : marked for deletion | ||
|
|
||
| Deleting --> [*] : final delete | ||
| @enduml | ||
| ``` | ||
|
|
||
| </div> | ||
|
|
||
| ## States | ||
|
|
||
| | State | Description | | ||
| |-------|-------------| | ||
| | **Initializing** | Switch is created in Carbide; controller performs initial setup. | | ||
| | **Configuring** | Switch is being configured (rotate OS password). Sub-state: `RotateOsPassword`. | | ||
| | **Validating** | Switch is being validated. Sub-state: `ValidateComplete`. | | ||
| | **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. | | ||
| | **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. | | ||
| | **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). | | ||
| | **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And then what? What is manual intervention and which state does it go to (i.e. how do you get out of the error state)?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that. The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Failed switch state will reflect in Rack sate also. Once switch issue is fixed, |
||
| | **Deleting** | Switch is being removed; ends in final delete (terminal). | | ||
|
|
||
| ## Transitions (by trigger) | ||
|
|
||
| | From | To | Trigger / Condition | | ||
| |------|-----|----------------------| | ||
| | *(create)* | Initializing | Switch created | | ||
| | Initializing | Configuring (RotateOsPassword) | Initialization complete | | ||
| | Configuring (RotateOsPassword) | Validating (ValidateComplete) | OS password rotated | | ||
| | Validating (ValidateComplete) | BomValidating (BomValidateComplete) | Validation complete | | ||
| | BomValidating (BomValidateComplete) | Ready | BOM validation complete | | ||
| | Ready | Deleting | `deleted` set (marked for deletion) | | ||
| | Ready | ReProvisioning (Start) | `switch_reprovisioning_requested` is set | | ||
| | ReProvisioning (Start) | ReProvisioning (WaitFirmwareUpdateCompletion) | Reprovision triggered | | ||
| | ReProvisioning (WaitFirmwareUpdateCompletion) | Ready | `firmware_upgrade_status == Completed` | | ||
| | ReProvisioning (WaitFirmwareUpdateCompletion) | Error | `firmware_upgrade_status == Failed { cause }` | | ||
| | Error | Deleting | `deleted` set (marked for deletion) | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to get out of Error without deleting the switch.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. noted .. will change |
||
| | Deleting | *(end)* | Final delete committed | | ||
|
|
||
| ## Implementation | ||
|
|
||
| - **State type**: `SwitchControllerState` in `crates/api-model/src/switch/mod.rs`. | ||
| - **Handlers**: `crates/api/src/state_controller/switch/` — one module per top-level state (`initializing`, `configuring`, `validating`, `bom_validating`, `ready`, `reprovisioning`, `error_state`, `deleting`). | ||
| - **Orchestration**: `SwitchStateHandler` in `handler.rs` delegates to the handler for the current `controller_state`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,6 +27,9 @@ use uuid::Uuid; | |
| "bmc_username", | ||
| "bmc_password", | ||
| "switch_serial_number", | ||
| "nvos_mac_address", | ||
| "nvos_username", | ||
| "nvos_password", | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this safe to keep in PG unencrypted? (i.e. are they well known defaults?)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they are well know default simillar to BMC creds. |
||
| ])))] | ||
| pub struct Args { | ||
| #[clap(short = 'a', long, help = "BMC MAC Address of the expected switch")] | ||
|
|
@@ -59,6 +62,12 @@ pub struct Args { | |
| )] | ||
| pub switch_serial_number: Option<String>, | ||
|
|
||
| #[clap( | ||
| long, | ||
| group = "group", | ||
| help = "NVOS MAC address of the expected switch" | ||
| )] | ||
| pub nvos_mac_address: Option<MacAddress>, | ||
| #[clap(long, group = "group", help = "NVOS username of the expected switch")] | ||
| pub nvos_username: Option<String>, | ||
| #[clap(long, group = "group", help = "NVOS password of the expected switch")] | ||
|
|
@@ -140,6 +149,7 @@ impl TryFrom<Args> for rpc::forge::ExpectedSwitch { | |
| labels: crate::metadata::parse_rpc_labels(args.labels.unwrap_or_default()), | ||
| }), | ||
| rack_id: args.rack_id, | ||
| nvos_mac_address: args.nvos_mac_address.map(|m| m.to_string()), | ||
| }) | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| -- Add switch_reprovisioning_requested and firmware_upgrade_status columns to switches table. | ||
| -- switch_reprovisioning_requested: when set by an external entity, the state controller (when switch is Ready) transitions to ReProvisioning::Start. | ||
| -- firmware_upgrade_status: used during ReProvisioning (WaitFirmwareUpdateCompletion): Started, InProgress, Completed, Failed. | ||
| ALTER TABLE switches | ||
| ADD COLUMN switch_reprovisioning_requested JSONB, | ||
| ADD COLUMN firmware_upgrade_status JSONB; |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| -- Add nvos_mac_address column to expected_switches table (NVOS host MAC, similar to bmc_mac_address). | ||
| ALTER TABLE expected_switches | ||
| ADD COLUMN nvos_mac_address macaddr; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need both nvos_mac_address and bmc_mac_address. You really only need expected_switches to know how to login to it's BMC.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is bug in nv switch redfish. In redfish oob there is no info available on cabled eth port on nvos. Without nvos eth0/1 mac address we cant assoicate explored Switch. Further more we cant upgrade firmware And we cant install anything on nvos to collect mapping info
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we then add both addresses into expected switches? |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more specifically what's being validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@narasimhan321 is working on set of test case. Will add them shortly