feat: (Doc) Rack State Machine interaction with Machine, Switch#623
feat: (Doc) Rack State Machine interaction with Machine, Switch#623vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
🛡️ Vulnerability Scan🚨 Found 72 vulnerability(ies) Severity Breakdown:
🔗 View full details in Security tab 🕐 Last updated: 2026-03-18 20:16:49 UTC | Commit: 635ee6c |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-18 20:18:44 UTC | Commit: 635ee6c |
Matthias247
left a comment
There was a problem hiding this comment.
There seems to be some duplication in the doc. Can you check?
Otherwise I'm mostly adding the same comments as in the last review meeting. We can definitely resolve them during implementation.
| [*] --> R_Created : site-op enters expected rack {rack-id and rack type} and site explore creates rack. | ||
| R_Created --> R_Initializing : machine or switch created with some rack ID\n(and expected rack type) | ||
| R_Initializing --> R_Discovering : any one machines, nvswitches discovered | ||
| R_Discovering --> R_Maintenance : when all machines (M_Ready) and switches (S_Ready)\nrack sends S_ReProvisioning, M_HostReprovision\nIssue Provision to compute, switch to BKG |
There was a problem hiding this comment.
Since Maintenance is very overloaded, I'd like to use a different name here (Setup, SoftwareDeployment, etc).
| state R_Maintenance { | ||
| [*] --> R_Maintenance_RMS_Firmware_Updates | ||
| state "RMS:Firmware Updates" as R_Maintenance_RMS_Firmware_Updates | ||
| state "RMS:Configure NMX Cluster" as R_Maintenance_RMS_Configure_NMX_Cluster |
There was a problem hiding this comment.
We should consider keeping it separate, so that the "setup cluster" step could be executed from "Ready" without having to reinstall all software on all components.
There was a problem hiding this comment.
@amit-pabalkar, any dependency on firmware update ? other wise we can seperate it
|
|
||
| --- | ||
|
|
||
| ## Expected Rack API Design |
There was a problem hiding this comment.
It#s all already in code and the gRPC schema. We can probably minimize things we want to repeat. Maybe the APIs, but for parameters viewers can just look at the gRPC schema.
There was a problem hiding this comment.
will remove .. @chet was little bit ahead of time :)
|
|
||
| Rack topology is used when transitioning from **R_Initializing** to **R_Discovering** (matching discovered machines and switches to this rack by rack ID and optionally by topology) and when validating that the rack is complete (all expected slots or counts satisfied). | ||
|
|
||
| **DB design** |
There was a problem hiding this comment.
Not very state machine related :)
I'm fine keeping it here for now. But I think once the code is integrated, we can just remove it from here. Or you just describe it in simple words "All components inside the rack link to their parent rack via a rack_id field in their database tables".
There was a problem hiding this comment.
sure will clean this up
635ee6c to
9fba703
Compare
9fba703 to
abc97d8
Compare
Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
abc97d8 to
640a8eb
Compare
Matthias247
left a comment
There was a problem hiding this comment.
I'm fine with merging this and updating according to the implementation
Description
combined state machines for Machine (each compute tray / managed host lifecycle), Switch (each switch), and Rack (collection of machines, switches, and power shelf). Proposal shows all three and the transitions between the Rack state machine and the Machine/Switch state machines.
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes