Skip to content

_get_amdgpu_kmd_version only working for DKMS amdgpu module? #6

@traversaro

Description

@traversaro

Hello everyone, and thanks for the work on the WheelNext initiative and this provider in particular. I know that this is work in progress, so feel free to ignore the issue if this is too soon to provide feedback.

I played a bit with the variant provider (with https://github.com/traversaro/variantlib-exps) on two systems, one a system with MI300X in which amdgpu kernel driver was installed via DKMS (the image is https://marketplace.digitalocean.com/apps/pytorch-rocm7). On that system, the KMD version is correctly detected by the function, and the value is coherent with the one reported by, see:

root@2-6-0---ROCm-7-0-gpu-mi300x1-192gb-devcloud-atl1:~/variantlib-exps# pixi run print-amd
✨ Pixi task (print-amd): python -c 'from amd_variant_provider.detect_rocm import get_system_info as amd_get_system_info; print(amd_get_system_info())'                                                            ⠁
{'kmd_version': KMDVersion(major=6, minor=14, patch=14), 'rocm_version': ROCmVersion(major=7, minor=0, patch=0), 'gfx_arch': ['gfx9', 'gfx942']}
root@2-6-0---ROCm-7-0-gpu-mi300x1-192gb-devcloud-atl1:~/variantlib-exps# rocm-smi --showdriverversion


============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.14.14
==========================================================================================
================================== End of ROCm SMI Log ===================================

instead, I also tried to run _get_amdgpu_kmd_version on a different system, a AMD Ryzen™ AI Max+ 395 . In that case, the amdgpu driver was built as part of the kernel itself, so the /sys/module/amdgpu/version file does not exist, even if the /sys/module/amdgpu/ folder exists. In that case, the _get_amdgpu_kmd_version function returns None, that is not coherent with the value return by rocm-smi --showdriverversion:

(rocm) root@c78c8dbcd428:/notebooks/variantlib-exps# pixi run print-amd
✨ Pixi task (print-amd): python -c 'from amd_variant_provider.detect_rocm import get_system_info as amd_get_system_info; print(amd_get_system_info())'     ⠁
{'rocm_version': ROCmVersion(major=7, minor=0, patch=0), 'gfx_arch': ['gfx11', 'gfx1151'], 'kmd_version': None}
(rocm) root@c78c8dbcd428:/notebooks/variantlib-exps# rocm-smi --showdriverversion


============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.14.0-32-generic
==========================================================================================
================================== End of ROCm SMI Log ===================================

I am not an expert of AMD world, so this may be expected, but in doubt I preferred to report the inconsistency betweeh _get_amdgpu_kmd_version() and rocm-smi --showdriverversion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions