Hi team,
I'd like to propose the addition of a new Prometheus metric to expose which node currently holds the reboot lock, for better observability and integration with monitoring systems.
Please consider adding a metric such as:
kured_current_node_lock{node="node-name"} = 1
This metric would:
- Be exposed on all the nodes, regardless if they currently holdthe lock.
- Allow Prometheus/Grafana to surface which node is actively coordinating a reboot.
- Help cluster operators quickly identify reboot activity across nodes.
Why This Is Useful:
- Debugging coordination issues: Knowing which node is holding the lock helps diagnose stuck or long reboots.
- Auditing: Helps confirm reboots are progressing as expected across rolling updates.
- Alerting: We can alert if a lock is held for too long or is stuck on a specific node.
It could be gated behind a feature flag or config option if needed. Since the locking mechanism is already in place via annotations or leases, this metric could easily map to the local node identity.
Happy to discuss more and open to provide a PR if a design decision is made.
Thanks!
Hi team,
I'd like to propose the addition of a new Prometheus metric to expose which node currently holds the reboot lock, for better observability and integration with monitoring systems.
Please consider adding a metric such as:
kured_current_node_lock{node="node-name"} = 1This metric would:
Why This Is Useful:
It could be gated behind a feature flag or config option if needed. Since the locking mechanism is already in place via annotations or leases, this metric could easily map to the local node identity.
Happy to discuss more and open to provide a PR if a design decision is made.
Thanks!