-
Notifications
You must be signed in to change notification settings - Fork 61
Description
Problem Description
Issue with amdsmi.amdsmi_get_energy_count() Method
Description
When using the amdsmi.amdsmi_get_energy_count() method, the change in total energy consumption reported in Joules is much less than what it should be. This is evident when using the AMDSMI CLI tool to query the total energy consumption.
Observed Behavior
When running amd-smi metric -pE, the output is as follows:
GPU: 0
POWER:
SOCKET_POWER: 35 W
GFX_VOLTAGE: N/A mV
SOC_VOLTAGE: N/A mV
MEM_VOLTAGE: N/A mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
ENERGY:
TOTAL_ENERGY_CONSUMPTION: 16.43 J
...
After waiting for one second and retrying:
GPU: 0
POWER:
SOCKET_POWER: 35 W
GFX_VOLTAGE: N/A mV
SOC_VOLTAGE: N/A mV
MEM_VOLTAGE: N/A mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
ENERGY:
TOTAL_ENERGY_CONSUMPTION: 16.43 J
...
Expected Behavior
This does not make sense. The formula E = P * t means that the total energy consumption should have increased by ~35J after one second. But it does not seem to change.
Operating System
NAME="Rocky Linux", VERSION="9.1 (Blue Onyx)"
CPU
AMD EPYC 7V13 64-Core Processor
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.1.0
ROCm Component
amdsmi
Steps to Reproduce
This shell script can help replicate the issue. It runs amd-smi metric and waits 5 seconds:
#!/bin/bash
# Function to get the total energy consumption of GPU 0
get_energy_consumption() {
amd-smi metric -pE | awk '/GPU: 0/,/GPU: 1/ { if ($1 == "TOTAL_ENERGY_CONSUMPTION:") print $2 }'
}
# Function to get the socket power of GPU 0
get_socket_power() {
amd-smi metric -pE | awk '/GPU: 0/,/GPU: 1/ { if ($1 == "SOCKET_POWER:") print $2 }'
}
# Get the initial energy consumption of GPU 0
initial_energy=$(get_energy_consumption)
# Get the socket power of GPU 0
socket_power=$(get_socket_power)
# Wait for five seconds
sleep 5
# Get the energy consumption of GPU 0 after five seconds
final_energy=$(get_energy_consumption)
# Calculate the difference in energy consumption
energy_difference=$(echo "$final_energy - $initial_energy" | bc)
# Calculate the expected energy consumption over five seconds
expected_energy_consumption=$(echo "$socket_power * 5" | bc)
# Print the initial, final, and difference in energy consumption, and expected energy consumption
echo "Initial energy consumed by GPU 0: $initial_energy J"
echo "Final energy consumed by GPU 0: $final_energy J"
echo "Energy consumed by GPU 0 in the last five seconds: $energy_difference J"
echo "Expected energy consumption in last 5 seconds: $expected_energy_consumption J"With my output being:
Initial energy consumed by GPU 0: 19.748 J
Final energy consumed by GPU 0: 19.748 J
Energy consumed by GPU 0 in the last five seconds: 0 J
Expected energy consumption in last 5 seconds: 160 J
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
We are using the AMD HPC cluster.