Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
- [NIC and Port selection](architecture/infiniband/nic_selection.md)
- [State Machines]()
- [Managed Host](architecture/state_machines/managedhost.md)
- [Switch](architecture/state_machines/switch.md)

# Manuals

Expand Down
81 changes: 81 additions & 0 deletions book/src/architecture/state_machines/switch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Switch State Diagram

This document describes the Finite State Machine (FSM) for Switches in Carbide: lifecycle from creation through configuration, validation, ready, optional reprovisioning, and deletion.

## High-Level Overview

The main flow shows the primary states and transitions:

<div style="width: 180%; background: white; margin-left: -40%;">

```plantuml
@startuml
skinparam state {
BackgroundColor White
}

state "Initializing" as Initializing
state "Configuring\n(RotateOsPassword)" as Configuring
state "Validating" as Validating
state "BomValidating" as BomValidating
state "Ready" as Ready
state "ReProvisioning\n(Start → WaitFirmware)" as ReProvisioning
state "Error" as Error
state "Deleting" as Deleting

[*] --> Initializing : Switch created

Initializing --> Configuring : init complete
Configuring --> Validating : rotate password done
Validating --> BomValidating : validation complete
BomValidating --> Ready : BOM validation complete

Ready --> Deleting : marked for deletion
Ready --> ReProvisioning : reprovision requested

ReProvisioning --> Ready : firmware upgrade Completed
ReProvisioning --> Error : firmware upgrade Failed

Error --> Deleting : marked for deletion

Deleting --> [*] : final delete
@enduml
```

</div>

## States

| State | Description |
|-------|-------------|
| **Initializing** | Switch is created in Carbide; controller performs initial setup. |
| **Configuring** | Switch is being configured (rotate OS password). Sub-state: `RotateOsPassword`. |
| **Validating** | Switch is being validated. Sub-state: `ValidateComplete`. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specifically what's being validated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@narasimhan321 is working on set of test case. Will add them shortly

| **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. |
| **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. |
| **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). |
| **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then what? What is manual intervention and which state does it go to (i.e. how do you get out of the error state)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.

The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed switch state will reflect in Rack sate also. Once switch issue is fixed, on-demand rack firmware update will be triggered. This will make sure delta trays moves to ready state and hence rack too

| **Deleting** | Switch is being removed; ends in final delete (terminal). |

## Transitions (by trigger)

| From | To | Trigger / Condition |
|------|-----|----------------------|
| *(create)* | Initializing | Switch created |
| Initializing | Configuring (RotateOsPassword) | Initialization complete |
| Configuring (RotateOsPassword) | Validating (ValidateComplete) | OS password rotated |
| Validating (ValidateComplete) | BomValidating (BomValidateComplete) | Validation complete |
| BomValidating (BomValidateComplete) | Ready | BOM validation complete |
| Ready | Deleting | `deleted` set (marked for deletion) |
| Ready | ReProvisioning (Start) | `switch_reprovisioning_requested` is set |
| ReProvisioning (Start) | ReProvisioning (WaitFirmwareUpdateCompletion) | Reprovision triggered |
| ReProvisioning (WaitFirmwareUpdateCompletion) | Ready | `firmware_upgrade_status == Completed` |
| ReProvisioning (WaitFirmwareUpdateCompletion) | Error | `firmware_upgrade_status == Failed { cause }` |
| Error | Deleting | `deleted` set (marked for deletion) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get out of Error without deleting the switch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted .. will change

| Deleting | *(end)* | Final delete committed |

## Implementation

- **State type**: `SwitchControllerState` in `crates/api-model/src/switch/mod.rs`.
- **Handlers**: `crates/api/src/state_controller/switch/` — one module per top-level state (`initializing`, `configuring`, `validating`, `bom_validating`, `ready`, `reprovisioning`, `error_state`, `deleting`).
- **Orchestration**: `SwitchStateHandler` in `handler.rs` delegates to the handler for the current `controller_state`.
3 changes: 3 additions & 0 deletions crates/admin-cli/src/expected_switch/add/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ pub struct Args {
)]
pub switch_serial_number: String,

#[clap(long, help = "NVOS MAC address of the expected switch")]
pub nvos_mac_address: Option<MacAddress>,
#[clap(long, help = "NVOS username of the expected switch")]
pub nvos_username: Option<String>,
#[clap(long, help = "NVOS password of the expected switch")]
Expand Down Expand Up @@ -89,6 +91,7 @@ impl From<Args> for rpc::forge::ExpectedSwitch {
switch_serial_number: value.switch_serial_number,
metadata: Some(metadata),
rack_id: value.rack_id,
nvos_mac_address: value.nvos_mac_address.map(|m| m.to_string()),
nvos_username: value.nvos_username,
nvos_password: value.nvos_password,
}
Expand Down
1 change: 1 addition & 0 deletions crates/admin-cli/src/expected_switch/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ pub struct ExpectedSwitchJson {
pub bmc_username: String,
pub bmc_password: String,
pub switch_serial_number: String,
pub nvos_mac_address: Option<MacAddress>,
pub nvos_username: Option<String>,
pub nvos_password: Option<String>,
#[serde(default)]
Expand Down
2 changes: 2 additions & 0 deletions crates/admin-cli/src/expected_switch/show/cmd.rs
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ fn convert_and_print_into_nice_table(
table.set_titles(row![
"Serial Number",
"BMC Mac",
"NVOS Mac",
"Interface IP",
"Associated Machine",
"Name",
Expand Down Expand Up @@ -141,6 +142,7 @@ fn convert_and_print_into_nice_table(
table.add_row(row![
expected_switch.switch_serial_number,
expected_switch.bmc_mac_address,
expected_switch.nvos_mac_address.as_deref().unwrap_or_default(),
machine_interface
.map(|x| x.address.join("\n"))
.unwrap_or("Undiscovered".to_string()),
Expand Down
10 changes: 10 additions & 0 deletions crates/admin-cli/src/expected_switch/update/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ use uuid::Uuid;
"bmc_username",
"bmc_password",
"switch_serial_number",
"nvos_mac_address",
"nvos_username",
"nvos_password",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe to keep in PG unencrypted? (i.e. are they well known defaults?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are well know default simillar to BMC creds. nvos_password will be rotated during the ingestion

])))]
pub struct Args {
#[clap(short = 'a', long, help = "BMC MAC Address of the expected switch")]
Expand Down Expand Up @@ -59,6 +62,12 @@ pub struct Args {
)]
pub switch_serial_number: Option<String>,

#[clap(
long,
group = "group",
help = "NVOS MAC address of the expected switch"
)]
pub nvos_mac_address: Option<MacAddress>,
#[clap(long, group = "group", help = "NVOS username of the expected switch")]
pub nvos_username: Option<String>,
#[clap(long, group = "group", help = "NVOS password of the expected switch")]
Expand Down Expand Up @@ -140,6 +149,7 @@ impl TryFrom<Args> for rpc::forge::ExpectedSwitch {
labels: crate::metadata::parse_rpc_labels(args.labels.unwrap_or_default()),
}),
rack_id: args.rack_id,
nvos_mac_address: args.nvos_mac_address.map(|m| m.to_string()),
})
}
}
2 changes: 1 addition & 1 deletion crates/admin-cli/src/rpc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -564,7 +564,6 @@ impl ApiClient {

Ok(self.0.update_expected_machine(request).await?)
}

pub async fn replace_all_expected_machines(
&self,
expected_machine_list: Vec<ExpectedMachineJson>,
Expand Down Expand Up @@ -635,6 +634,7 @@ impl ApiClient {
bmc_username: switch.bmc_username,
bmc_password: switch.bmc_password,
switch_serial_number: switch.switch_serial_number,
nvos_mac_address: switch.nvos_mac_address.map(|m| m.to_string()),
nvos_username: switch.nvos_username,
nvos_password: switch.nvos_password,
metadata: switch.metadata,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
-- Add switch_reprovisioning_requested and firmware_upgrade_status columns to switches table.
-- switch_reprovisioning_requested: when set by an external entity, the state controller (when switch is Ready) transitions to ReProvisioning::Start.
-- firmware_upgrade_status: used during ReProvisioning (WaitFirmwareUpdateCompletion): Started, InProgress, Completed, Failed.
ALTER TABLE switches
ADD COLUMN switch_reprovisioning_requested JSONB,
ADD COLUMN firmware_upgrade_status JSONB;
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-- Add nvos_mac_address column to expected_switches table (NVOS host MAC, similar to bmc_mac_address).
ALTER TABLE expected_switches
ADD COLUMN nvos_mac_address macaddr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both nvos_mac_address and bmc_mac_address. You really only need expected_switches to know how to login to it's BMC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is bug in nv switch redfish. In redfish oob there is no info available on cabled eth port on nvos. Without nvos eth0/1 mac address we cant assoicate explored Switch. Further more we cant upgrade firmware

And we cant install anything on nvos to collect mapping info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we then add both addresses into expected switches?

5 changes: 3 additions & 2 deletions crates/api-db/src/expected_switch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -165,9 +165,9 @@ pub async fn create(
) -> DatabaseResult<ExpectedSwitch> {
let id = switch.expected_switch_id.unwrap_or_else(Uuid::new_v4);
let query = "INSERT INTO expected_switches
(expected_switch_id, bmc_mac_address, bmc_username, bmc_password, serial_number, metadata_name, metadata_description, rack_id, metadata_labels, nvos_username, nvos_password)
(expected_switch_id, bmc_mac_address, bmc_username, bmc_password, serial_number, metadata_name, metadata_description, rack_id, metadata_labels, nvos_username, nvos_password, nvos_mac_address)
VALUES
($1::uuid, $2::macaddr, $3::varchar, $4::varchar, $5::varchar, $6::varchar, $7::varchar, $8::varchar, $9::jsonb, $10::varchar, $11::varchar) RETURNING *";
($1::uuid, $2::macaddr, $3::varchar, $4::varchar, $5::varchar, $6::varchar, $7::varchar, $8::varchar, $9::jsonb, $10::varchar, $11::varchar, $12::macaddr) RETURNING *";

sqlx::query_as(query)
.bind(id)
Expand All @@ -181,6 +181,7 @@ pub async fn create(
.bind(sqlx::types::Json(&switch.metadata.labels))
.bind(&switch.nvos_username)
.bind(&switch.nvos_password)
.bind(switch.nvos_mac_address)
.fetch_one(txn)
.await
.map_err(|err: sqlx::Error| match err {
Expand Down
74 changes: 73 additions & 1 deletion crates/api-db/src/switch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@ use chrono::prelude::*;
use config_version::{ConfigVersion, Versioned};
use futures::StreamExt;
use model::controller_outcome::PersistentStateHandlerOutcome;
use model::switch::{NewSwitch, Switch, SwitchControllerState};
use model::switch::{
FirmwareUpgradeStatus, NewSwitch, Switch, SwitchControllerState, SwitchReprovisionRequest,
};
use sqlx::PgConnection;

use crate::{
Expand Down Expand Up @@ -81,6 +83,8 @@ pub async fn create(txn: &mut PgConnection, new_switch: &NewSwitch) -> DatabaseR
version,
},
controller_state_outcome: None,
switch_reprovisioning_requested: None,
firmware_upgrade_status: None,
})
}

Expand Down Expand Up @@ -128,6 +132,18 @@ pub async fn find_by_id(txn: &mut PgConnection, id: &SwitchId) -> DatabaseResult
}
}

pub async fn find_by_host_mac_address(
txn: &mut PgConnection,
host_mac_address: &MacAddress,
) -> DatabaseResult<Option<Switch>> {
let query = sqlx::query_as::<_, Switch>("SELECT * FROM switches WHERE host_mac_address = $1");
query
.bind(host_mac_address)
.fetch_optional(txn)
.await
.map_err(|e| DatabaseError::new("find_by_host_mac_address", e))
}

pub async fn find_all(txn: &mut PgConnection) -> DatabaseResult<Vec<SwitchId>> {
let query = sqlx::query_as::<_, SwitchId>("SELECT id FROM switches WHERE deleted IS NULL");

Expand Down Expand Up @@ -207,6 +223,62 @@ pub async fn update_controller_state_outcome(
Ok(())
}

/// Sets switch_reprovisioning_requested on the switch. Can be called from any state machine or
/// service. When the switch is in Ready state, the switch state controller will observe the flag
/// and transition to ReProvisioning::Start.
pub async fn set_switch_reprovisioning_requested(
txn: &mut PgConnection,
switch_id: SwitchId,
initiator: &str,
) -> DatabaseResult<()> {
let req = SwitchReprovisionRequest {
requested_at: Utc::now(),
initiator: initiator.to_string(),
};
let query =
"UPDATE switches SET switch_reprovisioning_requested = $1 WHERE id = $2 RETURNING id";
sqlx::query_as::<_, SwitchId>(query)
.bind(sqlx::types::Json(req))
.bind(switch_id)
.fetch_optional(txn)
.await
.map_err(|e| DatabaseError::new("set_switch_reprovisioning_requested", e))?;
Ok(())
}

/// Clears switch_reprovisioning_requested. Typically called when reprovisioning completes or is
/// cancelled.
pub async fn clear_switch_reprovisioning_requested(
txn: &mut PgConnection,
switch_id: SwitchId,
) -> DatabaseResult<()> {
let query =
"UPDATE switches SET switch_reprovisioning_requested = NULL WHERE id = $1 RETURNING id";
sqlx::query_as::<_, SwitchId>(query)
.bind(switch_id)
.fetch_optional(txn)
.await
.map_err(|e| DatabaseError::new("clear_switch_reprovisioning_requested", e))?;
Ok(())
}

/// Sets firmware_upgrade_status on the switch. Call from any state machine or service to report
/// upgrade progress. WaitFirmwareUpdateCompletion reads this: Completed → Ready, Failed → Error.
pub async fn update_firmware_upgrade_status(
txn: &mut PgConnection,
switch_id: SwitchId,
status: Option<&FirmwareUpgradeStatus>,
) -> DatabaseResult<()> {
let query = "UPDATE switches SET firmware_upgrade_status = $1 WHERE id = $2 RETURNING id";
sqlx::query_as::<_, SwitchId>(query)
.bind(status.map(|s| sqlx::types::Json(s.clone())))
.bind(switch_id)
.fetch_optional(txn)
.await
.map_err(|e| DatabaseError::new("update_firmware_upgrade_status", e))?;
Ok(())
}

pub async fn mark_as_deleted<'a>(
switch: &'a mut Switch,
txn: &mut PgConnection,
Expand Down
18 changes: 17 additions & 1 deletion crates/api-model/src/expected_switch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,13 @@ use uuid::Uuid;

use crate::metadata::{Metadata, default_metadata_for_deserializer};

#[derive(Debug, Clone, Default, Deserialize)]
#[derive(Default, Debug, Clone, Deserialize)]
#[serde(default)]
pub struct ExpectedSwitch {
#[serde(default)]
pub expected_switch_id: Option<Uuid>,
pub bmc_mac_address: MacAddress,
pub nvos_mac_address: Option<MacAddress>,
pub bmc_username: String,
pub serial_number: String,
pub bmc_password: String,
Expand All @@ -52,9 +54,12 @@ impl<'r> FromRow<'r, PgRow> for ExpectedSwitch {
labels: labels.0,
};

let nvos_mac_address: Option<MacAddress> = row.try_get("nvos_mac_address").ok();

Ok(ExpectedSwitch {
expected_switch_id: row.try_get("expected_switch_id")?,
bmc_mac_address: row.try_get("bmc_mac_address")?,
nvos_mac_address,
bmc_username: row.try_get("bmc_username")?,
serial_number: row.try_get("serial_number")?,
bmc_password: row.try_get("bmc_password")?,
Expand All @@ -75,6 +80,7 @@ impl From<ExpectedSwitch> for rpc::forge::ExpectedSwitch {
value: u.to_string(),
}),
bmc_mac_address: expected_switch.bmc_mac_address.to_string(),
nvos_mac_address: expected_switch.nvos_mac_address.map(|m| m.to_string()),
bmc_username: expected_switch.bmc_username,
bmc_password: expected_switch.bmc_password,
switch_serial_number: expected_switch.serial_number,
Expand All @@ -92,6 +98,15 @@ impl TryFrom<rpc::forge::ExpectedSwitch> for ExpectedSwitch {
fn try_from(rpc: rpc::forge::ExpectedSwitch) -> Result<Self, Self::Error> {
let bmc_mac_address = MacAddress::try_from(rpc.bmc_mac_address.as_str())
.map_err(|_| RpcDataConversionError::InvalidMacAddress(rpc.bmc_mac_address.clone()))?;
let nvos_mac_address = if rpc.nvos_mac_address.is_none() {
None
} else {
let mac_address = rpc.nvos_mac_address.unwrap();
Some(
MacAddress::try_from(mac_address.as_str())
.map_err(|_| RpcDataConversionError::InvalidMacAddress(mac_address))?,
)
};
let expected_switch_id = rpc
.expected_switch_id
.map(|u| {
Expand All @@ -111,6 +126,7 @@ impl TryFrom<rpc::forge::ExpectedSwitch> for ExpectedSwitch {
nvos_password: rpc.nvos_password,
metadata,
rack_id: rpc.rack_id,
nvos_mac_address,
})
}
}
Expand Down
Loading
Loading