Skip to content

[Issue]: Incorrect Energy Consumption Reported by amdsmi_get_energy_count() Method #38

@parthraut

Description

@parthraut

Problem Description

Issue with amdsmi.amdsmi_get_energy_count() Method

Description

When using the amdsmi.amdsmi_get_energy_count() method, the change in total energy consumption reported in Joules is much less than what it should be. This is evident when using the AMDSMI CLI tool to query the total energy consumption.

Observed Behavior

When running amd-smi metric -pE, the output is as follows:

GPU: 0
POWER:
SOCKET_POWER: 35 W
GFX_VOLTAGE: N/A mV
SOC_VOLTAGE: N/A mV
MEM_VOLTAGE: N/A mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
ENERGY:
TOTAL_ENERGY_CONSUMPTION: 16.43 J
...

After waiting for one second and retrying:

GPU: 0
POWER:
SOCKET_POWER: 35 W
GFX_VOLTAGE: N/A mV
SOC_VOLTAGE: N/A mV
MEM_VOLTAGE: N/A mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
ENERGY:
TOTAL_ENERGY_CONSUMPTION: 16.43 J
...

Expected Behavior

This does not make sense. The formula E = P * t means that the total energy consumption should have increased by ~35J after one second. But it does not seem to change.

Operating System

NAME="Rocky Linux", VERSION="9.1 (Blue Onyx)"

CPU

AMD EPYC 7V13 64-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.1.0

ROCm Component

amdsmi

Steps to Reproduce

This shell script can help replicate the issue. It runs amd-smi metric and waits 5 seconds:

#!/bin/bash

# Function to get the total energy consumption of GPU 0
get_energy_consumption() {
    amd-smi metric -pE | awk '/GPU: 0/,/GPU: 1/ { if ($1 == "TOTAL_ENERGY_CONSUMPTION:") print $2 }'
}

# Function to get the socket power of GPU 0
get_socket_power() {
    amd-smi metric -pE | awk '/GPU: 0/,/GPU: 1/ { if ($1 == "SOCKET_POWER:") print $2 }'
}

# Get the initial energy consumption of GPU 0
initial_energy=$(get_energy_consumption)

# Get the socket power of GPU 0
socket_power=$(get_socket_power)

# Wait for five seconds
sleep 5

# Get the energy consumption of GPU 0 after five seconds
final_energy=$(get_energy_consumption)

# Calculate the difference in energy consumption
energy_difference=$(echo "$final_energy - $initial_energy" | bc)

# Calculate the expected energy consumption over five seconds
expected_energy_consumption=$(echo "$socket_power * 5" | bc)

# Print the initial, final, and difference in energy consumption, and expected energy consumption
echo "Initial energy consumed by GPU 0: $initial_energy J"
echo "Final energy consumed by GPU 0: $final_energy J"
echo "Energy consumed by GPU 0 in the last five seconds: $energy_difference J"
echo "Expected energy consumption in last 5 seconds: $expected_energy_consumption J"

With my output being:
Initial energy consumed by GPU 0: 19.748 J
Final energy consumed by GPU 0: 19.748 J
Energy consumed by GPU 0 in the last five seconds: 0 J
Expected energy consumption in last 5 seconds: 160 J

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

We are using the AMD HPC cluster.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions