Replace deprecated rocm-smi with amd-smi for NUMA topology#33
Open
KerwinTsaiii wants to merge 1 commit into
Open
Replace deprecated rocm-smi with amd-smi for NUMA topology#33KerwinTsaiii wants to merge 1 commit into
KerwinTsaiii wants to merge 1 commit into
Conversation
rocm-smi is deprecated in favor of amd-smi. Use `amd-smi static --numa --csv` to build the device-to-NUMA mapping. Since amd-smi emits one row per cpu_list entry per GPU, collapse to the first row of each GPU and re-emit the legacy "cardN <numa_node> <numa_affinity>" layout so the downstream parsing is unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
rocm-smiis deprecated in favor ofamd-smi(ROCm 7.x marks the legacy CLI for removal). The launcher's NUMA-affinity logic inscripts/run_rochpl.instill shells out torocm-smi --csv --showtoponuma, which is now deprecated and will eventually break oncerocm-smiis dropped. This PR migrates that call toamd-smiwhile keeping behavior identical.Technical Details
Replaced:
devicelist=$(${rocm_dir}/bin/rocm-smi --csv --showtoponuma | tail -n +2 | tr ',' "\t")with:
amd-smi static --numa --csvemits multiple rows per GPU (one percpu_listentry), unlikerocm-smi's single row per device. Theawkfilter!seen[$1]++keeps only the first row of each GPU.The output is re-emitted in the legacy
cardN <numa_node> <numa_affinity>tab-separated layout, so the downstream parsing (device_to_numa+=(${line[1]})andn_devices=$(... grep -c "card")) is unchanged.Test Plan
rocm-smi --showtoponuma --csvagainst the newamd-smi static --numa --csvtransform on a ROCm 7.2.0 host (8-GPU node).device_to_numa,n_devices) with the newdevicelistoutput to confirm identical values.Test Result
Both tools report the same NUMA node for each device, and the transform yields one row per GPU in device-index order:
Downstream parsing produced identical
device_to_numaandn_devicesvalues, confirming no behavioral change to CPU/GPU binding.Submission Checklist