Skip to content

Feat: Add api to get machines with leaks#570

Open
srinivasadmurthy wants to merge 12 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2
Open

Feat: Add api to get machines with leaks#570
srinivasadmurthy wants to merge 12 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2

Conversation

@srinivasadmurthy
Copy link
Contributor

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.

@srinivasadmurthy srinivasadmurthy requested a review from a team as a code owner March 16, 2026 05:46
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the exact use-case for this.

But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.

@srinivasadmurthy
Copy link
Contributor Author

@Matthias247 @kensimon Thanks for your review feedback. I have implemented the suggest changes and am requesting a re-review.

@kensimon
Copy link
Contributor

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

@zhaozhongn
Copy link

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

Yes that's the long-term intention. In short-term, people were not sure about what health streaming/push mechanism should be, hence we opt to do this query model for now. It will still be very useful for other non-handling purpose (e.g., we will check if any tray in a rack has a leak before turning on host on a tray). But yes, handling scenario will switch to a fast method if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants