Skip to content

feat: (Doc) Rack State Machine interaction with Machine, Switch#623

Open
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/state
Open

feat: (Doc) Rack State Machine interaction with Machine, Switch#623
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/state

Conversation

@vinodchitraliNVIDIA
Copy link

Description

combined state machines for Machine (each compute tray / managed host lifecycle), Switch (each switch), and Rack (collection of machines, switches, and power shelf). Proposal shows all three and the transitions between the Rack state machine and the Machine/Switch state machines.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 72
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-18 20:16:49 UTC | Commit: 635ee6c

@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-18 20:18:44 UTC | Commit: 635ee6c

Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some duplication in the doc. Can you check?

Otherwise I'm mostly adding the same comments as in the last review meeting. We can definitely resolve them during implementation.

[*] --> R_Created : site-op enters expected rack {rack-id and rack type} and site explore creates rack.
R_Created --> R_Initializing : machine or switch created with some rack ID\n(and expected rack type)
R_Initializing --> R_Discovering : any one machines, nvswitches discovered
R_Discovering --> R_Maintenance : when all machines (M_Ready) and switches (S_Ready)\nrack sends S_ReProvisioning, M_HostReprovision\nIssue Provision to compute, switch to BKG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Maintenance is very overloaded, I'd like to use a different name here (Setup, SoftwareDeployment, etc).

state R_Maintenance {
[*] --> R_Maintenance_RMS_Firmware_Updates
state "RMS:Firmware Updates" as R_Maintenance_RMS_Firmware_Updates
state "RMS:Configure NMX Cluster" as R_Maintenance_RMS_Configure_NMX_Cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider keeping it separate, so that the "setup cluster" step could be executed from "Ready" without having to reinstall all software on all components.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amit-pabalkar, any dependency on firmware update ? other wise we can seperate it


---

## Expected Rack API Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It#s all already in code and the gRPC schema. We can probably minimize things we want to repeat. Maybe the APIs, but for parameters viewers can just look at the gRPC schema.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove .. @chet was little bit ahead of time :)


Rack topology is used when transitioning from **R_Initializing** to **R_Discovering** (matching discovered machines and switches to this rack by rack ID and optionally by topology) and when validating that the rack is complete (all expected slots or counts satisfied).

**DB design**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very state machine related :)

I'm fine keeping it here for now. But I think once the code is integrated, we can just remove it from here. Or you just describe it in simple words "All components inside the rack link to their parent rack via a rack_id field in their database tables".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will clean this up

Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with merging this and updating according to the implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants