Feature/prometheus power meter#2390
Feature/prometheus power meter#2390GitMeder wants to merge 6 commits intosustainable-computing-io:mainfrom
Conversation
Add comprehensive proposal for GPU power monitoring with:
Architecture:
- Vendor-agnostic design with registry pattern
- NVIDIA backend using NVML + dcgm-exporter (for MIG)
- Placeholder for AMD (ROCm SMI) and Intel (Level Zero)
Key features:
- Per-process power attribution via compute utilization
- GPU sharing modes: exclusive, time-slicing, MIG
- Idle power detection and active power calculation
- MIG support via hybrid NVML topology + dcgm-exporter activity
Includes Grafana screenshots demonstrating:
- Node GPU power consumption
- Per-process GPU power attribution
- GPU idle power baseline
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: GitMeder <moeder@t-online.de>
Signed-off-by: GitMeder <moeder@t-online.de>
Signed-off-by: GitMeder <moeder@t-online.de>
47dfad0 to
76489b3
Compare
|
hi @GitMeder what does PDU power mean? on first glance it appears to be power of the whole node, including all hardware components (disks, fans, cards, cpu, memory etc). |
@vimalk78 Thanks for your feedback. I am working with @GitMeder on this extension. 'PDU' stands for 'power distribution unit' (e.g., https://www.rittal.com/de-de/products/PG20231215POW101/PG20240419STR003/PRO115364), so yes, these readings are for the full node including all devices (e.g., memory, disc and network), similar to the existing Redfish platform power reader (https://github.com/sustainable-computing-io/kepler/blob/main/docs/user/configuration.md#-experimental-configuration). We are aware that the current attribution logic is based on CPU usage, but providing Kepler users with a realistic full-node power value might give them a better idea of the overall power consumption. Most of it will probably end up in the 'Idle Power' bucket of the Kepler attribution anyway: https://github.com/sustainable-computing-io/kepler/blob/main/docs/developer/power-attribution-guide.md. This might also be a useful addition for nodes including ARM or RISC-V processors, without RAPL or Nvidia NVML access. |
Motivation
Kepler currently relies on platform-based power sources (e.g. RAPL, Redfish).
In our setup, total server power is measured externally via PDUs and exposed
to Prometheus through the Rittal exporter.
This PR adds a generic Prometheus-based power input to Kepler, enabling
process- and node-level power attribution based on externally measured
power metrics.
What is included
Example use case
PDU → Rittal Exporter → Prometheus (recording rules)
→ Kepler fetches
*_power_watts_avg→ Kepler attributes power to processes
→ Results exposed again via Prometheus for visualization (e.g. Grafana)
Testing
The feature was validated in a real environment using
Rittal PDUs → Prometheus / VictoriaMetrics → Kepler.
Due to the external dependency on Prometheus metrics,
no automated unit tests are included yet.
Follow-up tests can be added once the interface is stabilized.