Skip to content

[test/don't merge] CI: add GPU hang diagnostics and restrict tests to gdb.rocm#186

Closed
lumachad wants to merge 2 commits into
amd-stagingfrom
users/lumachad/amd-staging/debug-gpu-hang-diagnostics
Closed

[test/don't merge] CI: add GPU hang diagnostics and restrict tests to gdb.rocm#186
lumachad wants to merge 2 commits into
amd-stagingfrom
users/lumachad/amd-staging/debug-gpu-hang-diagnostics

Conversation

@lumachad

Copy link
Copy Markdown
Collaborator

Summary

  • Adds a GPU hang diagnostics step that runs before the rocgdb tests, dumping:
    • dmesg (last 200 lines, via sudo dmesg --notime with fallbacks)
    • amd-smi static, process, metric, topology
    • rocm-smi, --showpids, --showmeminfo vram, --showuse, --showclkfrq, --showerrors
  • Restricts the test run to gdb.rocm only (--tests gdb.rocm) to reduce CI time while investigating hangs.

Do not merge — diagnostic/investigative branch only.

@lumachad lumachad changed the title CI: add GPU hang diagnostics and restrict tests to gdb.rocm [test/don't merge] CI: add GPU hang diagnostics and restrict tests to gdb.rocm Jun 25, 2026
@lumachad lumachad self-assigned this Jun 25, 2026
@lumachad lumachad force-pushed the users/lumachad/amd-staging/debug-gpu-hang-diagnostics branch 4 times, most recently from a4cfa62 to f6a2252 Compare June 25, 2026 12:58
Add a pre-test step that dumps dmesg, amd-smi, and rocm-smi output to
help diagnose GPU hangs. Restrict the rocgdb test run to gdb.rocm only
with --tests gdb.rocm.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@lumachad lumachad force-pushed the users/lumachad/amd-staging/debug-gpu-hang-diagnostics branch 2 times, most recently from c0324bc to 97c108e Compare June 26, 2026 20:44
@lumachad lumachad closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant