Feat: Add api to get machines with leaks#570
Feat: Add api to get machines with leaks#570srinivasadmurthy wants to merge 12 commits intoNVIDIA:mainfrom
Conversation
Matthias247
left a comment
There was a problem hiding this comment.
I don't know about the exact use-case for this.
But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.
|
@Matthias247 @kensimon Thanks for your review feedback. I have implemented the suggest changes and am requesting a re-review. |
crates/api-db/migrations/20260316223033_machines_health_override_gin.sql
Outdated
Show resolved
Hide resolved
|
I'm going to quote this comment from @srinivasadmurthy to get a discussion going:
I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines. It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about? |
Yes that's the long-term intention. In short-term, people were not sure about what health streaming/push mechanism should be, hence we opt to do this query model for now. It will still be very useful for other non-handling purpose (e.g., we will check if any tray in a rack has a leak before turning on host on a tray). But yes, handling scenario will switch to a fast method if needed. |
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.