Skip to content

[release/cvs-0.2.0] cvs/rccl: fix dmesg false positives and per-user RCCL hostfile#170

Merged
speriaswamy-amd merged 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/dmesg-pattern-fix
May 18, 2026
Merged

[release/cvs-0.2.0] cvs/rccl: fix dmesg false positives and per-user RCCL hostfile#170
speriaswamy-amd merged 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/dmesg-pattern-fix

Conversation

@speriaswamy-amd

@speriaswamy-amd speriaswamy-amd commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Cherry-pick of #169 onto release/cvs-0.2.0. Three small fixes from debugging a 28-node rccl_perf run on a shared cluster:

  • cvs/lib/verify_lib.py — surface amdgpu runlist oversubscription as a non-fatal WARN. Per AMD docs, this is a real perf-degrading state but the collective itself completes correctly. New module-level warn_patterns_dict scanned alongside err_patterns_dict inside verify_dmesg_for_errors. Backward-compatible signature & return shape; existing callers untouched.
  • cvs/tests/rccl/rccl_perf.pyverify_dmesg_for_errors(..., till_end_flag=False) so each parametrized test's dmesg scan is bounded by its own start/end window (prevents one test's kernel event from cascading into N reported failures).
  • cvs/lib/rccl_lib.py — per-$USER mpirun hostfile path (/tmp/rccl_hosts_file_<USER>.txt); replaces the sudo rm -f workaround that was on this branch.

Conflict resolution notes

  • cvs/lib/rccl_lib.py: this branch had a sudo rm -f /tmp/rccl_hosts_file.txt workaround for the same multi-user collision. The per-user path supersedes it; workaround removed.
  • cvs/tests/rccl/rccl_perf.py: main gates verify_dmesg_for_errors on if can_use_sudo: (passwordless-sudo guard not yet on this branch). For this cherry-pick the call stays unconditional — only the till_end_flag value is changed — to avoid introducing an undefined can_use_sudo reference.
  • cvs/lib/verify_lib.py: auto-merged cleanly.

Test plan

Refs: AIMVT-175

Comment thread cvs/lib/verify_lib.py
# Note: 'Runlist is getting oversubscribed' and 'Expect reduced ROCm performance'
# are amdgpu kernel info-level messages (not errors). They fire routinely on
# large multi-rank RCCL runs whenever HSA queue count exceeds the runlist
# size, even when the run itself is healthy. Excluded from failure matching.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add links to the reference doc in the comment for future reference

@speriaswamy-amd speriaswamy-amd merged commit 0d8f4ff into release/cvs-0.2.0 May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants