Add IFoE TransferBench scale-up preflight check [AIMVT-181] by speriaswamy-amd · Pull Request #192 · ROCm/cvs

speriaswamy-amd · 2026-05-29T18:18:54Z

Summary

Adds an opt-in preflight check that validates IFoE (Infinity Fabric over
Ethernet, a.k.a. XGMI-over-Ethernet) scale-up data-path connectivity
one layer above the AIMVT-180 L2 ping, by running the TransferBench
candidate-branch smoketest preset on every reachable cluster node and
reconciling the binary's exit code with the per-cell
[PASS] / [FAIL] / [SKIP] markers in its output. Disabled by default
(connectivity_check.transferbench.connectivity_mode = "skip"), so it has
no effect on clusters that don't run the candidate-branch TransferBench
build.

Stacked on

This PR is stacked on #188 (AIMVT-180, IFoE L2 ping). It inserts
test_ifoe_transferbench_smoke between test_ifoe_l2_connectivity and
test_rdma_connectivity in the same runner and extends the shared report
generator, so it builds directly on the AIMVT-180 pieces. GitHub will
auto-rebase / fast-forward once #188 lands; the AIMVT-181 commit itself
adds only the new files and the new wiring.

Motivation

AIMVT-180 covers L2 reachability (one afmctl test ping per BDF / dst
accelerator pair), but does not exercise the IFoE data path the workloads
actually use. AIMVT-181 fills that gap: a fast pre-workload gate that
asks every reachable node to push real GPU-to-GPU traffic across the IFoE
fabric and validates the result before the heavier downstream tests
(RDMA full-mesh, RCCL, training) burn cycles on a broken fabric.

Technical Details

New module --- `cvs/lib/preflight/transferbench_smoke.py`

TransferBenchSmokeCheck: orchestrates the smoketest dispatch in one of
two modes:
- per_node (default) --- each reachable node runs an independent
  single-rank TransferBench against its local GPUs. Exercises intra-node
  AID↔MID IFoE hops but does not traverse the rack IFoE switch.
- multi_rank --- every reachable node is wired into one
  socket-comm cluster (TB_NUM_RANKS=N, TB_RANK=0..N-1,
  TB_MASTER_ADDR=<rank0>, TB_MASTER_PORT=<configured>) and the whole
  fleet is launched via a single phdl.exec_cmd_list call so the
  preset's socket-comm bootstrap can complete. Closest thing to a full
  fabric scale-up test the candidate branch ships today. Auto-degrades
  to per_node when fewer than two reachable hosts remain.
extract_node_pod_membership() + reconcile_cluster_vpod(): tolerant
parsers for amd-smi fabric --topology --json payloads. Handle a
flat list, the gpu_data wrapper, per-key dicts, and a plaintext
fallback for amd-smi builds without --json support. Used as a
pre-dispatch precondition: every node must report exactly one local
vpod_id and all nodes must share the same vpod_id (the smoketest
preset itself aborts with ERR_FATAL when ranks span multiple vPods, and
we want to surface that as a clear cluster-level error rather than as
an opaque exit-2 from TransferBench).
SmoketestParser: tolerant parser that accepts bracketed-verdict
([PASS]/[FAIL]/[SKIP]), marker-block (*/F/.), and aggregate
N/M PASS, x FAIL, y SKIP summary shapes. Counts the markers, captures
warnings / fatal-keyword lines, and recovers the binary's exit code
from a __TB_SMOKE_EXIT__=$? sentinel appended to stdout by the
orchestrator (so we are not at the mercy of the parallel SSH layer's
exit-code handling).
evaluate_smoketest(): derives the per-node PASS / FAIL / WARNING
verdict from the parsed result. Verdict precedence: sentinel missing
→ FAIL; exit 2 (ERR_FATAL precondition) → FAIL with a precondition
explanation; any non-zero exit → FAIL; FAIL markers / fatal-keyword
lines despite exit zero → FAIL (defence in depth); num_skip / num_tests
over the configured max_skip_pct → WARNING; else PASS.

New helper --- `cvs/lib/rocm_plib.py`

get_gpu_fabric_info_dict(phdl, use_sudo=True, amd_smi_path='amd-smi')
joins the existing amd-smi / rocm-smi helpers in this file. Returns
the parsed amd-smi fabric JSON per node for other future consumers (the
preflight orchestrator uses its own copy that also tolerates plaintext
output, so this helper sits unused for now).

New pytest entry --- `cvs/tests/preflight/preflight_checks.py`

test_ifoe_transferbench_smoke is wired in between the existing
test_ifoe_l2_connectivity and the existing test_rdma_connectivity.
Opt-in via connectivity_check.transferbench.connectivity_mode
("run" or "skip"; default "skip"). Failed nodes are reported but
not pruned from phdl --- operators decide whether to gate downstream
testing on the result. Registered with the report generator's required
checks list so it always renders in the executive summary.

New config block --- `cvs/input/config_file/preflight/preflight_config.json`

connectivity_check.transferbench: connectivity_mode, tb_binary,
rocm_path, amd_smi_path, use_sudo, preset, size_list,
num_iterations, num_warmups, always_validate, run_parallel,
use_bdma, force_single_pod, rank_mode, socket_master_port,
master_node, max_skip_pct, ssh_timeout, skip_pod_check. Every
key has an inline _comment_* doc.

Reporting --- `cvs/lib/preflight/report.py`

Adds the TransferBench smoketest row to the executive summary table.
Adds dedicated HTML section: shared vpod/ppod state from the
precondition, rank-mode, totals (nodes pass/warn/fail, tests
pass/fail/skip), and a per-node failure detail table with verdict
errors, parsed marker counts, exit code, expandable stdout, and the
rendered command.
Adds recommendations for FAIL and WARNING terminal states.

Documentation

cvs/tests/preflight/README.md: new "IFoE TransferBench Smoketest
(AIMVT-181)" section with the precondition / orchestration / verdict
details and an example config block.
cvs/input/config_file/preflight/README_preflight_config.md: full
parameter reference, structure-overview update, and a callout for the
new opt-in block.

Command shape

Each rank's command is rendered as:

[sudo] bash -c '[PATH=<rocm>/bin:$PATH LD_LIBRARY_PATH=<rocm>/lib:${LD_LIBRARY_PATH:-}] \
  NUM_ITERATIONS=<n> NUM_WARMUPS=<n> ALWAYS_VALIDATE=1 RUN_PARALLEL=1 \
  USE_REMOTE_READ=1 BLOCK_BYTES=256 [USE_BDMA=1] [FORCE_SINGLE_POD=1] \
  [TB_NUM_RANKS=<n> TB_RANK=<r> TB_MASTER_ADDR=<rank0> TB_MASTER_PORT=<port>] \
  <tb_binary> smoketest <size_list...>; echo "__TB_SMOKE_EXIT__=$?"'

The env-var prefix lives inside the bash -c so that, even with
use_sudo=True, the privileged child sees the assignments (sudo
otherwise sanitises its calling shell's environment).

Test Plan

Unit tests: python3 -m unittest cvs.lib.preflight.unittests.test_transferbench_smoke
→ 40/40 pass. Coverage:
- Topology parser: list / gpu_data wrapper / keyed-dict / mixed-vpod /
  plaintext / garbage payloads.
- Cluster reconcile: uniform / split / missing-on-some-nodes /
  mixed-local-vPod.
- Smoketest parser: passing / failing / skip-heavy /
  fatal-precondition / marker-table fallback / empty / garbage.
- Verdict logic: every branch including sentinel-missing FAIL,
  exit-2 ERR_FATAL FAIL, skip-budget WARNING, and FAIL-markers-despite-exit-0
  defence path.
- Orchestrator: command rendering (defaults, sudo + ROCm path, multi-rank
  socket env), per_node PASS, multi-rank dispatch via
  exec_cmd_list, multi-rank degradation with 1 reachable host,
  pod-check bypass, plaintext fabric fallback, vPod-divergence FAIL,
  one-failing-node FAIL, skip-budget WARNING, no-reachable-hosts FAIL,
  exit-2 precondition FAIL.
AIMVT-180 + RDMA regression: python3 -m unittest cvs.lib.preflight.unittests.test_ifoe_l2_connectivity cvs.lib.preflight.unittests.test_rdma_connectivity cvs.lib.preflight.unittests.test_transferbench_smoke → 86/86 pass.
Lint: ruff check on the touched files passes cleanly.
Config: python3 -m json.tool on the updated preflight_config.json
parses cleanly.
Backwards compatibility: default connectivity_mode = "skip" means no
behavioral change for existing clusters; the new pytest entry records a
SKIPPED result and returns immediately without contacting nodes.

Out of Scope

Performance gating (this PR only enforces functional PASS/FAIL/SKIP;
a follow-up will layer a bandwidth-floor gate on top of the smoketest's
per-test bandwidth numbers once internal acceptance thresholds are
finalised).
Cross-pod (RNIC scale-out) data-path validation --- the smoketest
preset is intentionally a single-vPod check.
Installing TransferBench / the candidate-branch build --- AIMVT-171
(already merged) and the orchestrator fixture handle the install path.

Refs: AIMVT-181

Made with Cursor

Adds an opt-in preflight check that validates IFoE (Infinity Fabric over Ethernet a.k.a. XGMI-over-Ethernet) scale-up data-path connectivity one layer above the AIMVT-180 L2 ping, by running the TransferBench candidate branch `smoketest` preset on every reachable cluster node and reconciling the binary's exit code with the per-cell `[PASS] / [FAIL] / [SKIP]` markers in its output. Disabled by default (`connectivity_check.transferbench.connectivity_mode = "skip"`), so it has no effect on clusters that don't run the candidate-branch TransferBench build. Before the smoketest dispatches, the check enforces a single-vPod precondition by parsing `amd-smi fabric --topology --json` on every reachable node -- the TransferBench smoketest preset itself exits with ERR_FATAL (exit 2) when ranks span multiple virtual pods, so we surface the underlying environment issue with a clear cluster-level error rather than blaming the binary. Changes: - New module `cvs/lib/preflight/transferbench_smoke.py` with: - `TransferBenchSmokeCheck` orchestrator (`per_node` independent runs and `multi_rank` socket-comm runs that thread `TB_NUM_RANKS=N` / `TB_RANK=0..N-1` / `TB_MASTER_ADDR=<rank0>` through one parallel SSH dispatch so the smoketest's bootstrap can complete). - `extract_node_pod_membership()` / `reconcile_cluster_vpod()` that handle list, `gpu_data` wrapper, and per-key dict shapes of the amd-smi fabric JSON, plus a plaintext fallback parser for builds without --json. - `SmoketestParser` that handles bracketed-verdict, marker-block, and aggregate-summary output shapes, and `evaluate_smoketest()` that derives the per-node PASS / FAIL / WARNING verdict from the binary's exit code (recovered via a `__TB_SMOKE_EXIT__=$?` sentinel appended to stdout), reported markers, and a configurable skip-budget. - New pytest entry `test_ifoe_transferbench_smoke` wired into `cvs/tests/preflight/preflight_checks.py` between `test_ifoe_l2_connectivity` and `test_rdma_connectivity`. - New `connectivity_check.transferbench` config block in `preflight_config.json` (tb_binary, rocm_path, amd_smi_path, use_sudo, preset, size_list, num_iterations / num_warmups, always_validate, run_parallel, use_bdma, force_single_pod, rank_mode, socket_master_port, master_node, max_skip_pct, ssh_timeout, skip_pod_check). Every key has an inline `_comment_*` doc. - Executive-summary entry + dedicated HTML section in `cvs/lib/preflight/report.py` with shared vpod/ppod state, rank-mode, totals, and per-node failure detail (verdict errors, parsed marker counts, exit code, stdout snippets, and the rendered command). - New `get_gpu_fabric_info_dict()` helper in `cvs/lib/rocm_plib.py` alongside the existing `amd-smi`/`rocm-smi` helpers, returning parsed amd-smi fabric JSON per node for other consumers. - 40 unit tests covering the topology parser (list / gpu_data / keyed / mixed-vpod / plaintext / garbage payloads), the cluster reconcile (uniform / split / missing / mixed-local vPod), the output parser (passing / failing / skip-heavy / fatal-precondition / marker table / empty / garbage), the verdict logic (every branch including sentinel-missing and skip-budget WARNING), and the orchestrator (command rendering with sudo + ROCm path, per_node + multi_rank dispatch, multi_rank degradation, pod-check bypass, plaintext fabric fallback, no-reachable-hosts). Documentation: - `cvs/tests/preflight/README.md` and `cvs/input/config_file/preflight/README_preflight_config.md` updated with the new check, its precondition, the rank-mode trade-off, the verdict logic, and an example config block. Refs: AIMVT-181 Made with [Cursor](https://cursor.com) Co-authored-by: Cursor <cursoragent@cursor.com>

…er PATH to cluster env_vars [AIMVT-181] PR #192 review feedback: the cluster file already exposes a top-level `env_vars` dict that the parallel SSH layer exports on every host before each command (see `cvs/input/cluster_file/README.md` and the `env_vars=env_vars` wiring at `cvs/tests/preflight/preflight_checks.py` around L206). Re-exposing `rocm_path` and `amd_smi_path` in the transferbench preflight block duplicated that mechanism for a single check and gave operators two non-orthogonal ways to point at a non-default ROCm install. This change removes the duplication and standardises on the cluster file as the single cluster-wide source of truth for tool location. Removed: - `connectivity_check.transferbench.rocm_path` (config key) - `connectivity_check.transferbench.amd_smi_path` (config key) - `TransferBenchSmokeCheck(..., rocm_path=..., amd_smi_path=...)` constructor kwargs - `TransferBenchSmokeCheck._rocm_env_prefix()` (private helper that emitted inline `PATH=<rocm>/bin:$PATH LD_LIBRARY_PATH=<rocm>/lib:...`) - `DEFAULT_AMD_SMI_PATH` module-level constant Kept (intentionally): - `connectivity_check.transferbench.tb_binary` (default `"TransferBench"`). TransferBench is a test-specific binary -- not shared infrastructure -- so it stays in the per-check config rather than polluting the cluster file with a per-test name. Defaults to PATH-resolution so a site that installs it on PATH via cluster `env_vars` gets zero-config behaviour; override here only when this single check needs to point at a different binary than the rest of the cluster's tooling. Same shape as the AIMVT-180 `afmctl_path` knob. After this change, `TransferBenchSmokeCheck.build_command()` emits only TransferBench-semantic env vars (NUM_ITERATIONS, ALWAYS_VALIDATE, RUN_PARALLEL, FORCE_SINGLE_POD, optional TB_NUM_RANKS / TB_RANK / TB_MASTER_ADDR / TB_MASTER_PORT) inside the inner `bash -c` shell. PATH / LD_LIBRARY_PATH come exclusively from the cluster file `env_vars` block. `_amd_smi_fabric_command()` uses bare `amd-smi`, also PATH-resolved on each node. Test updates: - Replaced the `test_build_command_respects_sudo_and_rocm_path` test with two regression guards: - `test_build_command_does_not_inject_path_or_ld_library_path` -- asserts neither `PATH=` nor `LD_LIBRARY_PATH=` appears in the rendered command in either sudo or non-sudo paths. - `test_amd_smi_fabric_command_uses_bare_binary` -- asserts the pod membership query is exactly `[sudo ]amd-smi fabric --topology --json`. - Added `test_constructor_rejects_removed_path_kwargs` so stale callers passing the removed kwargs fail loudly with `TypeError` instead of being silently accepted. Doc updates: - `cvs/input/config_file/preflight/preflight_config.json`: dropped the two keys + their `_comment_*` fields; expanded the top-level transferbench `_comment` to point operators at cluster file `env_vars`. - `cvs/input/config_file/preflight/README_preflight_config.md`: dropped the two bullet rows; added a PATH / LD_LIBRARY_PATH note that links to `cvs/input/cluster_file/README.md`. - `cvs/tests/preflight/README.md`: dropped the inline `PATH=<rocm>/bin... LD_LIBRARY_PATH=<rocm>/lib...` fragment from the command template; trimmed the example JSON block. Verification: - New unit tests: 43 / 43 pass (40 originals + 3 new regression guards). - AIMVT-180 IFoE L2 regression: 20 / 20 pass. - Full preflight unittest discovery sweep: 89 / 89 pass. - `preflight_config.json` JSON validity: OK. - Behaviour for clusters that previously set neither `rocm_path` nor `amd_smi_path`: unchanged. - Behaviour for clusters that previously set them: same effect achieved by lifting the same PATH override into cluster file `env_vars` (the README change documents this migration). Co-authored-by: Cursor <cursoragent@cursor.com>

speriaswamy-amd · 2026-06-03T11:26:09Z

Addressed review feedback: drop `rocm_path` / `amd_smi_path`; defer PATH to cluster `env_vars`

Pushed 57c7b7b. Six files, +98 / -57.

What changed

Removed the two redundant per-check knobs from the transferbench preflight config and from TransferBenchSmokeCheck itself:

Removed	Replaced by
`connectivity_check.transferbench.rocm_path`	`env_vars` in the cluster file (single cluster-wide source of truth)
`connectivity_check.transferbench.amd_smi_path`	Bare `amd-smi` resolved from `PATH` (set via cluster `env_vars`)
`TransferBenchSmokeCheck.__init__(rocm_path=..., amd_smi_path=...)`	(gone — `TypeError` if passed)
`TransferBenchSmokeCheck._rocm_env_prefix()`	(gone — `build_command()` no longer emits `PATH=` / `LD_LIBRARY_PATH=`)
`DEFAULT_AMD_SMI_PATH` constant	(gone)

build_command() now only emits TransferBench-semantic env vars (NUM_ITERATIONS, ALWAYS_VALIDATE, RUN_PARALLEL, FORCE_SINGLE_POD, plus the optional TB_* socket-comm trio in multi_rank mode). _amd_smi_fabric_command() is now exactly [sudo ]amd-smi fabric --topology --json.

What was kept (intentionally) and why

tb_binary stayed in the per-check config (default "TransferBench", PATH-resolved). TransferBench is a test-specific binary, not shared infrastructure — putting it in the cluster file would require every cluster file (across health, RCCL, RVS, etc.) to know about a per-preflight binary name. The pattern matches AIMVT-180's afmctl_path knob, which is the same shape. A site that installs TransferBench on PATH via cluster env_vars gets zero-config behaviour; override tb_binary here only when this single preflight check needs to point at a different binary than the rest of the cluster's tooling.

If you'd rather we drop tb_binary too and force PATH-resolution unconditionally, happy to do that in a follow-up — let me know.

Migration for operators who previously set the removed knobs

Same effect, lifted up one layer:

 // cluster.json
 {
   "env_vars": {
-    // (was empty)
+    "PATH": "/opt/rocm/bin:$PATH",
+    "LD_LIBRARY_PATH": "/opt/rocm/lib:$LD_LIBRARY_PATH"
   },
   ...
 }

 // preflight_config.json
 {
   "connectivity_check": {
     "transferbench": {
       "connectivity_mode": "run",
-      "rocm_path": "/opt/rocm",
-      "amd_smi_path": "amd-smi",
       "tb_binary": "TransferBench",
       ...
     }
   }
 }

This is documented in cvs/input/config_file/preflight/README_preflight_config.md and cvs/tests/preflight/README.md with a link back to cvs/input/cluster_file/README.md.

Regression guards added

To prevent the duplication from creeping back, three new unit tests:

test_build_command_does_not_inject_path_or_ld_library_path — asserts that neither PATH= nor LD_LIBRARY_PATH= appears in the rendered command in either sudo or non-sudo paths.
test_amd_smi_fabric_command_uses_bare_binary — asserts the pod-membership query is exactly [sudo ]amd-smi fabric --topology --json.
test_constructor_rejects_removed_path_kwargs — asserts passing either removed kwarg raises TypeError so stale callers fail loudly.

Verification

Check	Result
`test_transferbench_smoke`	43 / 43 pass (40 originals + 3 new regression guards)
`test_ifoe_l2_connectivity` (AIMVT-180 regression)	20 / 20 pass
Full preflight unittest discovery sweep	89 / 89 pass
`preflight_config.json` JSON validity	OK
Behaviour for clusters that previously set neither	Unchanged
Behaviour for clusters that previously set them	Same effect via cluster `env_vars` (documented)

Ready for another look.

speriaswamy-amd and others added 2 commits May 29, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add IFoE TransferBench scale-up preflight check [AIMVT-181]#192

Add IFoE TransferBench scale-up preflight check [AIMVT-181]#192
speriaswamy-amd wants to merge 2 commits into
surya/aimvt-180-ifoe-l2-preflightfrom
surya/aimvt-181-ifoe-transferbench-preflight

speriaswamy-amd commented May 29, 2026

Uh oh!

speriaswamy-amd commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

speriaswamy-amd commented May 29, 2026

Summary

Stacked on

Motivation

Technical Details

New module --- cvs/lib/preflight/transferbench_smoke.py

New helper --- cvs/lib/rocm_plib.py

New pytest entry --- cvs/tests/preflight/preflight_checks.py

New config block --- cvs/input/config_file/preflight/preflight_config.json

Reporting --- cvs/lib/preflight/report.py

Documentation

Command shape

Test Plan

Out of Scope

Uh oh!

speriaswamy-amd commented Jun 3, 2026

Addressed review feedback: drop rocm_path / amd_smi_path; defer PATH to cluster env_vars

What changed

What was kept (intentionally) and why

Migration for operators who previously set the removed knobs

Regression guards added

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New module --- `cvs/lib/preflight/transferbench_smoke.py`

New helper --- `cvs/lib/rocm_plib.py`

New pytest entry --- `cvs/tests/preflight/preflight_checks.py`

New config block --- `cvs/input/config_file/preflight/preflight_config.json`

Reporting --- `cvs/lib/preflight/report.py`

Addressed review feedback: drop `rocm_path` / `amd_smi_path`; defer PATH to cluster `env_vars`