Skip to content

Replace deprecated rocm-smi with amd-smi for NUMA topology#33

Open
KerwinTsaiii wants to merge 1 commit into
ROCm:mainfrom
KerwinTsaiii:replace-rocm-smi-with-amd-smi
Open

Replace deprecated rocm-smi with amd-smi for NUMA topology#33
KerwinTsaiii wants to merge 1 commit into
ROCm:mainfrom
KerwinTsaiii:replace-rocm-smi-with-amd-smi

Conversation

@KerwinTsaiii

Copy link
Copy Markdown

Motivation

rocm-smi is deprecated in favor of amd-smi (ROCm 7.x marks the legacy CLI for removal). The launcher's NUMA-affinity logic in scripts/run_rochpl.in still shells out to rocm-smi --csv --showtoponuma, which is now deprecated and will eventually break once rocm-smi is dropped. This PR migrates that call to amd-smi while keeping behavior identical.

Technical Details

  • Replaced:

    devicelist=$(${rocm_dir}/bin/rocm-smi --csv --showtoponuma | tail -n +2 | tr ',' "\t")

    with:

    devicelist=$(${rocm_dir}/bin/amd-smi static --numa --csv | tail -n +2 | \
      awk -F',' 'NF>1 && !seen[$1]++ {printf "card%s\t%s\t%s\n", $1, $2, $3}')
  • amd-smi static --numa --csv emits multiple rows per GPU (one per cpu_list entry), unlike rocm-smi's single row per device. The awk filter !seen[$1]++ keeps only the first row of each GPU.

  • The output is re-emitted in the legacy cardN <numa_node> <numa_affinity> tab-separated layout, so the downstream parsing (device_to_numa+=(${line[1]}) and n_devices=$(... grep -c "card")) is unchanged.

Note: The WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status message is not related to this change. It is emitted by the underlying AMD SMI library / amdgpu runtime power management whenever a GPU is queried while in a low-power (runtime-suspended) state, and both rocm-smi and amd-smi print it. It is harmless and out of scope for this PR.

Test Plan

  • Compared rocm-smi --showtoponuma --csv against the new amd-smi static --numa --csv transform on a ROCm 7.2.0 host (8-GPU node).
  • Simulated the downstream parsing (device_to_numa, n_devices) with the new devicelist output to confirm identical values.

Test Result

  • Both tools report the same NUMA node for each device, and the transform yields one row per GPU in device-index order:

    $ devicelist=$(${rocm_dir}/bin/amd-smi static --numa --csv | tail -n +2 | \
        awk -F',' 'NF>1 && !seen[$1]++ {printf "card%s\t%s\t%s\n", $1, $2, $3}')
    $ echo "$devicelist"
    card0   0   0
    card1   0   0
    card2   0   0
    card3   0   0
    card4   1   1
    card5   1   1
    card6   1   1
    card7   1   1
  • Downstream parsing produced identical device_to_numa and n_devices values, confirming no behavioral change to CPU/GPU binding.

Submission Checklist

rocm-smi is deprecated in favor of amd-smi. Use `amd-smi static --numa
--csv` to build the device-to-NUMA mapping. Since amd-smi emits one row
per cpu_list entry per GPU, collapse to the first row of each GPU and
re-emit the legacy "cardN <numa_node> <numa_affinity>" layout so the
downstream parsing is unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
@KerwinTsaiii KerwinTsaiii requested a review from pbauman as a code owner June 16, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant