Feature/prometheus power meter by GitMeder · Pull Request #2390 · sustainable-computing-io/kepler

GitMeder · 2026-01-30T17:23:22Z

Motivation

Kepler currently relies on platform-based power sources (e.g. RAPL, Redfish).
In our setup, total server power is measured externally via PDUs and exposed
to Prometheus through the Rittal exporter.

This PR adds a generic Prometheus-based power input to Kepler, enabling
process- and node-level power attribution based on externally measured
power metrics.

What is included

New experimental Prometheus power input
Query-based power retrieval from Prometheus / VictoriaMetrics
Node and process power attribution based on CPU usage
Linux-compatible implementation (no RAPL / hwmon required)

Example use case

PDU → Rittal Exporter → Prometheus (recording rules)
→ Kepler fetches *_power_watts_avg
→ Kepler attributes power to processes
→ Results exposed again via Prometheus for visualization (e.g. Grafana)

Testing

The feature was validated in a real environment using
Rittal PDUs → Prometheus / VictoriaMetrics → Kepler.

Due to the external dependency on Prometheus metrics,
no automated unit tests are included yet.
Follow-up tests can be added once the interface is stabilized.

Add comprehensive proposal for GPU power monitoring with: Architecture: - Vendor-agnostic design with registry pattern - NVIDIA backend using NVML + dcgm-exporter (for MIG) - Placeholder for AMD (ROCm SMI) and Intel (Level Zero) Key features: - Per-process power attribution via compute utilization - GPU sharing modes: exclusive, time-slicing, MIG - Idle power detection and active power calculation - MIG support via hybrid NVML topology + dcgm-exporter activity Includes Grafana screenshots demonstrating: - Node GPU power consumption - Per-process GPU power attribution - GPU idle power baseline Signed-off-by: Vimal Kumar <vimal78@gmail.com> Signed-off-by: GitMeder <moeder@t-online.de>

Signed-off-by: GitMeder <moeder@t-online.de>

vimalk78 · 2026-02-02T17:26:10Z

hi @GitMeder
Thanks for the PR.

what does PDU power mean? on first glance it appears to be power of the whole node, including all hardware components (disks, fans, cards, cpu, memory etc).
In kepler, the power attributed to processes (based on process's CPU usage) is power consumed by CPU only, coming from RAPL or HWMon. So we need more elaborate resource usages data to attribute total PDU power to processes.

brunnert · 2026-02-03T07:40:39Z

hi @GitMeder Thanks for the PR.

what does PDU power mean? on first glance it appears to be power of the whole node, including all hardware components (disks, fans, cards, cpu, memory etc). In kepler, the power attributed to processes (based on process's CPU usage) is power consumed by CPU only, coming from RAPL or HWMon. So we need more elaborate resource usages data to attribute total PDU power to processes.

@vimalk78 Thanks for your feedback. I am working with @GitMeder on this extension. 'PDU' stands for 'power distribution unit' (e.g., https://www.rittal.com/de-de/products/PG20231215POW101/PG20240419STR003/PRO115364), so yes, these readings are for the full node including all devices (e.g., memory, disc and network), similar to the existing Redfish platform power reader (https://github.com/sustainable-computing-io/kepler/blob/main/docs/user/configuration.md#-experimental-configuration). We are aware that the current attribution logic is based on CPU usage, but providing Kepler users with a realistic full-node power value might give them a better idea of the overall power consumption. Most of it will probably end up in the 'Idle Power' bucket of the Kepler attribution anyway: https://github.com/sustainable-computing-io/kepler/blob/main/docs/developer/power-attribution-guide.md. This might also be a useful addition for nodes including ARM or RISC-V processors, without RAPL or Nvidia NVML access.

github-actions bot added the chore Routine tasks or maintenance label Jan 30, 2026

vimalk78 and others added 3 commits February 2, 2026 14:21

Add prometheus-power input for node/process power attribution on Linux

27baed3

Signed-off-by: GitMeder <moeder@t-online.de>

chore: add example config and ignore local kepler binary

76489b3

Signed-off-by: GitMeder <moeder@t-online.de>

GitMeder force-pushed the feature/prometheus-power-meter branch from 47dfad0 to 76489b3 Compare February 2, 2026 13:25

github-actions bot added the docs Documentation changes label Feb 2, 2026

brunnert added 3 commits February 3, 2026 08:45

Merge branch 'main' into feature/prometheus-power-meter

0ce27b2

Merge branch 'main' into feature/prometheus-power-meter

285c617

Merge branch 'main' into feature/prometheus-power-meter

6fd3a0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/prometheus power meter#2390

Feature/prometheus power meter#2390
GitMeder wants to merge 6 commits intosustainable-computing-io:mainfrom
GitMeder:feature/prometheus-power-meter

GitMeder commented Jan 30, 2026

Uh oh!

vimalk78 commented Feb 2, 2026

Uh oh!

brunnert commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

GitMeder commented Jan 30, 2026

Motivation

What is included

Example use case

Testing

Uh oh!

vimalk78 commented Feb 2, 2026

Uh oh!

brunnert commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants