Skip to content

Feature/prometheus power meter#2390

Open
GitMeder wants to merge 6 commits intosustainable-computing-io:mainfrom
GitMeder:feature/prometheus-power-meter
Open

Feature/prometheus power meter#2390
GitMeder wants to merge 6 commits intosustainable-computing-io:mainfrom
GitMeder:feature/prometheus-power-meter

Conversation

@GitMeder
Copy link

Motivation

Kepler currently relies on platform-based power sources (e.g. RAPL, Redfish).
In our setup, total server power is measured externally via PDUs and exposed
to Prometheus through the Rittal exporter.

This PR adds a generic Prometheus-based power input to Kepler, enabling
process- and node-level power attribution based on externally measured
power metrics.

What is included

  • New experimental Prometheus power input
  • Query-based power retrieval from Prometheus / VictoriaMetrics
  • Node and process power attribution based on CPU usage
  • Linux-compatible implementation (no RAPL / hwmon required)

Example use case

PDU → Rittal Exporter → Prometheus (recording rules)
→ Kepler fetches *_power_watts_avg
→ Kepler attributes power to processes
→ Results exposed again via Prometheus for visualization (e.g. Grafana)

Testing

The feature was validated in a real environment using
Rittal PDUs → Prometheus / VictoriaMetrics → Kepler.

Due to the external dependency on Prometheus metrics,
no automated unit tests are included yet.
Follow-up tests can be added once the interface is stabilized.

@github-actions github-actions bot added the chore Routine tasks or maintenance label Jan 30, 2026
vimalk78 and others added 3 commits February 2, 2026 14:21
    Add comprehensive proposal for GPU power monitoring with:

    Architecture:
    - Vendor-agnostic design with registry pattern
    - NVIDIA backend using NVML + dcgm-exporter (for MIG)
    - Placeholder for AMD (ROCm SMI) and Intel (Level Zero)

    Key features:
    - Per-process power attribution via compute utilization
    - GPU sharing modes: exclusive, time-slicing, MIG
    - Idle power detection and active power calculation
    - MIG support via hybrid NVML topology + dcgm-exporter activity

    Includes Grafana screenshots demonstrating:
    - Node GPU power consumption
    - Per-process GPU power attribution
    - GPU idle power baseline

Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: GitMeder <moeder@t-online.de>
Signed-off-by: GitMeder <moeder@t-online.de>
@GitMeder GitMeder force-pushed the feature/prometheus-power-meter branch from 47dfad0 to 76489b3 Compare February 2, 2026 13:25
@github-actions github-actions bot added the docs Documentation changes label Feb 2, 2026
@vimalk78
Copy link
Collaborator

vimalk78 commented Feb 2, 2026

hi @GitMeder
Thanks for the PR.

what does PDU power mean? on first glance it appears to be power of the whole node, including all hardware components (disks, fans, cards, cpu, memory etc).
In kepler, the power attributed to processes (based on process's CPU usage) is power consumed by CPU only, coming from RAPL or HWMon. So we need more elaborate resource usages data to attribute total PDU power to processes.

@brunnert
Copy link
Contributor

brunnert commented Feb 3, 2026

hi @GitMeder Thanks for the PR.

what does PDU power mean? on first glance it appears to be power of the whole node, including all hardware components (disks, fans, cards, cpu, memory etc). In kepler, the power attributed to processes (based on process's CPU usage) is power consumed by CPU only, coming from RAPL or HWMon. So we need more elaborate resource usages data to attribute total PDU power to processes.

@vimalk78 Thanks for your feedback. I am working with @GitMeder on this extension. 'PDU' stands for 'power distribution unit' (e.g., https://www.rittal.com/de-de/products/PG20231215POW101/PG20240419STR003/PRO115364), so yes, these readings are for the full node including all devices (e.g., memory, disc and network), similar to the existing Redfish platform power reader (https://github.com/sustainable-computing-io/kepler/blob/main/docs/user/configuration.md#-experimental-configuration). We are aware that the current attribution logic is based on CPU usage, but providing Kepler users with a realistic full-node power value might give them a better idea of the overall power consumption. Most of it will probably end up in the 'Idle Power' bucket of the Kepler attribution anyway: https://github.com/sustainable-computing-io/kepler/blob/main/docs/developer/power-attribution-guide.md. This might also be a useful addition for nodes including ARM or RISC-V processors, without RAPL or Nvidia NVML access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Routine tasks or maintenance docs Documentation changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants