Feat: Add api to get machines with leaks by srinivasadmurthy · Pull Request #570 · NVIDIA/ncx-infra-controller-core

srinivasadmurthy · 2026-03-16T05:46:26Z

Description

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.

copy-pr-bot · 2026-03-16T05:46:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

crates/api-db/src/machine.rs

crates/api/src/handlers/health.rs

crates/rpc/proto/forge.proto

crates/health/Cargo.toml

Matthias247

I don't know about the exact use-case for this.

But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.

srinivasadmurthy · 2026-03-17T05:28:10Z

@Matthias247 @kensimon Thanks for your review feedback. I have implemented the suggest changes and am requesting a re-review.

crates/api-db/migrations/20260316223033_machines_health_override_gin.sql

crates/rpc/proto/forge.proto

kensimon · 2026-03-17T19:52:59Z

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

zhaozhongn · 2026-03-17T20:00:48Z

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

Yes that's the long-term intention. In short-term, people were not sure about what health streaming/push mechanism should be, hence we opt to do this query model for now. It will still be very useful for other non-handling purpose (e.g., we will check if any tray in a rack has a leak before turning on host on a tray). But yes, handling scenario will switch to a fast method if needed.

crates/api-db/src/machine.rs

crates/rpc/proto/forge.proto

crates/api-db/src/machine.rs

Feat: Add api to get machines with leaks

41e6276

srinivasadmurthy requested a review from a team as a code owner March 16, 2026 05:46

srinivasadmurthy requested review from FrankSpitulski and yoks March 16, 2026 05:46

kensimon requested changes Mar 16, 2026

View reviewed changes

Matthias247 reviewed Mar 16, 2026

View reviewed changes

srinivasadmurthy added 3 commits March 16, 2026 23:47

Feat: Add api to get machines with leaks

8ba7ebe

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

d9b5943

Feat: Add api to get machines with leaks

90f08a5

kensimon requested changes Mar 17, 2026

View reviewed changes

crates/api-db/migrations/20260316223033_machines_health_override_gin.sql Outdated Show resolved Hide resolved

crates/rpc/proto/forge.proto Outdated Show resolved Hide resolved

Matthias247 reviewed Mar 17, 2026

View reviewed changes

crates/api-db/src/machine.rs Outdated Show resolved Hide resolved

crates/rpc/proto/forge.proto Outdated Show resolved Hide resolved

srinivasadmurthy added 2 commits March 18, 2026 00:39

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

b226544

Feat: Add api to get machines with leaks

39008b3

kensimon requested changes Mar 18, 2026

View reviewed changes

crates/api-db/src/machine.rs Outdated Show resolved Hide resolved

crates/api-db/src/machine.rs Outdated Show resolved Hide resolved

srinivasadmurthy added 2 commits March 18, 2026 21:19

Feat: Add api to get machines with leaks

0cda5db

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

20788e8

srinivasadmurthy requested a review from kensimon March 19, 2026 17:26

kensimon approved these changes Mar 19, 2026

View reviewed changes

Matthias247 reviewed Mar 19, 2026

View reviewed changes

crates/api-db/src/machine.rs Show resolved Hide resolved

crates/api-db/src/machine.rs Show resolved Hide resolved

srinivasadmurthy added 4 commits March 19, 2026 18:42

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

afd2dd6

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

76e8e28

Feat: Add api to get machines with leaks

6862e6d

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

7ddde63

srinivasadmurthy requested a review from kensimon March 20, 2026 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add api to get machines with leaks#570

Feat: Add api to get machines with leaks#570
srinivasadmurthy wants to merge 12 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2

srinivasadmurthy commented Mar 16, 2026

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matthias247 left a comment

Uh oh!

srinivasadmurthy commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

kensimon commented Mar 17, 2026

Uh oh!

zhaozhongn commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

srinivasadmurthy commented Mar 16, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

srinivasadmurthy commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

kensimon commented Mar 17, 2026

Uh oh!

zhaozhongn commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants