Add IFoE TransferBench scale-up preflight check [AIMVT-181]#192
Add IFoE TransferBench scale-up preflight check [AIMVT-181]#192speriaswamy-amd wants to merge 2 commits into
Conversation
Adds an opt-in preflight check that validates IFoE (Infinity Fabric over
Ethernet a.k.a. XGMI-over-Ethernet) scale-up data-path connectivity one
layer above the AIMVT-180 L2 ping, by running the TransferBench candidate
branch `smoketest` preset on every reachable cluster node and reconciling
the binary's exit code with the per-cell `[PASS] / [FAIL] / [SKIP]`
markers in its output. Disabled by default
(`connectivity_check.transferbench.connectivity_mode = "skip"`), so it has
no effect on clusters that don't run the candidate-branch TransferBench
build.
Before the smoketest dispatches, the check enforces a single-vPod
precondition by parsing `amd-smi fabric --topology --json` on every
reachable node -- the TransferBench smoketest preset itself exits with
ERR_FATAL (exit 2) when ranks span multiple virtual pods, so we surface
the underlying environment issue with a clear cluster-level error rather
than blaming the binary.
Changes:
- New module `cvs/lib/preflight/transferbench_smoke.py` with:
- `TransferBenchSmokeCheck` orchestrator (`per_node` independent runs
and `multi_rank` socket-comm runs that thread `TB_NUM_RANKS=N` /
`TB_RANK=0..N-1` / `TB_MASTER_ADDR=<rank0>` through one parallel SSH
dispatch so the smoketest's bootstrap can complete).
- `extract_node_pod_membership()` / `reconcile_cluster_vpod()` that
handle list, `gpu_data` wrapper, and per-key dict shapes of the
amd-smi fabric JSON, plus a plaintext fallback parser for builds
without --json.
- `SmoketestParser` that handles bracketed-verdict, marker-block, and
aggregate-summary output shapes, and `evaluate_smoketest()` that
derives the per-node PASS / FAIL / WARNING verdict from the binary's
exit code (recovered via a `__TB_SMOKE_EXIT__=$?` sentinel appended
to stdout), reported markers, and a configurable skip-budget.
- New pytest entry `test_ifoe_transferbench_smoke` wired into
`cvs/tests/preflight/preflight_checks.py` between
`test_ifoe_l2_connectivity` and `test_rdma_connectivity`.
- New `connectivity_check.transferbench` config block in
`preflight_config.json` (tb_binary, rocm_path, amd_smi_path, use_sudo,
preset, size_list, num_iterations / num_warmups, always_validate,
run_parallel, use_bdma, force_single_pod, rank_mode,
socket_master_port, master_node, max_skip_pct, ssh_timeout,
skip_pod_check). Every key has an inline `_comment_*` doc.
- Executive-summary entry + dedicated HTML section in
`cvs/lib/preflight/report.py` with shared vpod/ppod state, rank-mode,
totals, and per-node failure detail (verdict errors, parsed marker
counts, exit code, stdout snippets, and the rendered command).
- New `get_gpu_fabric_info_dict()` helper in `cvs/lib/rocm_plib.py`
alongside the existing `amd-smi`/`rocm-smi` helpers, returning parsed
amd-smi fabric JSON per node for other consumers.
- 40 unit tests covering the topology parser (list / gpu_data /
keyed / mixed-vpod / plaintext / garbage payloads), the cluster
reconcile (uniform / split / missing / mixed-local vPod), the output
parser (passing / failing / skip-heavy / fatal-precondition / marker
table / empty / garbage), the verdict logic (every branch including
sentinel-missing and skip-budget WARNING), and the orchestrator
(command rendering with sudo + ROCm path, per_node + multi_rank
dispatch, multi_rank degradation, pod-check bypass, plaintext fabric
fallback, no-reachable-hosts).
Documentation:
- `cvs/tests/preflight/README.md` and
`cvs/input/config_file/preflight/README_preflight_config.md` updated
with the new check, its precondition, the rank-mode trade-off, the
verdict logic, and an example config block.
Refs: AIMVT-181
Made with [Cursor](https://cursor.com)
Co-authored-by: Cursor <cursoragent@cursor.com>
…er PATH to cluster env_vars [AIMVT-181] PR #192 review feedback: the cluster file already exposes a top-level `env_vars` dict that the parallel SSH layer exports on every host before each command (see `cvs/input/cluster_file/README.md` and the `env_vars=env_vars` wiring at `cvs/tests/preflight/preflight_checks.py` around L206). Re-exposing `rocm_path` and `amd_smi_path` in the transferbench preflight block duplicated that mechanism for a single check and gave operators two non-orthogonal ways to point at a non-default ROCm install. This change removes the duplication and standardises on the cluster file as the single cluster-wide source of truth for tool location. Removed: - `connectivity_check.transferbench.rocm_path` (config key) - `connectivity_check.transferbench.amd_smi_path` (config key) - `TransferBenchSmokeCheck(..., rocm_path=..., amd_smi_path=...)` constructor kwargs - `TransferBenchSmokeCheck._rocm_env_prefix()` (private helper that emitted inline `PATH=<rocm>/bin:$PATH LD_LIBRARY_PATH=<rocm>/lib:...`) - `DEFAULT_AMD_SMI_PATH` module-level constant Kept (intentionally): - `connectivity_check.transferbench.tb_binary` (default `"TransferBench"`). TransferBench is a test-specific binary -- not shared infrastructure -- so it stays in the per-check config rather than polluting the cluster file with a per-test name. Defaults to PATH-resolution so a site that installs it on PATH via cluster `env_vars` gets zero-config behaviour; override here only when this single check needs to point at a different binary than the rest of the cluster's tooling. Same shape as the AIMVT-180 `afmctl_path` knob. After this change, `TransferBenchSmokeCheck.build_command()` emits only TransferBench-semantic env vars (NUM_ITERATIONS, ALWAYS_VALIDATE, RUN_PARALLEL, FORCE_SINGLE_POD, optional TB_NUM_RANKS / TB_RANK / TB_MASTER_ADDR / TB_MASTER_PORT) inside the inner `bash -c` shell. PATH / LD_LIBRARY_PATH come exclusively from the cluster file `env_vars` block. `_amd_smi_fabric_command()` uses bare `amd-smi`, also PATH-resolved on each node. Test updates: - Replaced the `test_build_command_respects_sudo_and_rocm_path` test with two regression guards: - `test_build_command_does_not_inject_path_or_ld_library_path` -- asserts neither `PATH=` nor `LD_LIBRARY_PATH=` appears in the rendered command in either sudo or non-sudo paths. - `test_amd_smi_fabric_command_uses_bare_binary` -- asserts the pod membership query is exactly `[sudo ]amd-smi fabric --topology --json`. - Added `test_constructor_rejects_removed_path_kwargs` so stale callers passing the removed kwargs fail loudly with `TypeError` instead of being silently accepted. Doc updates: - `cvs/input/config_file/preflight/preflight_config.json`: dropped the two keys + their `_comment_*` fields; expanded the top-level transferbench `_comment` to point operators at cluster file `env_vars`. - `cvs/input/config_file/preflight/README_preflight_config.md`: dropped the two bullet rows; added a PATH / LD_LIBRARY_PATH note that links to `cvs/input/cluster_file/README.md`. - `cvs/tests/preflight/README.md`: dropped the inline `PATH=<rocm>/bin... LD_LIBRARY_PATH=<rocm>/lib...` fragment from the command template; trimmed the example JSON block. Verification: - New unit tests: 43 / 43 pass (40 originals + 3 new regression guards). - AIMVT-180 IFoE L2 regression: 20 / 20 pass. - Full preflight unittest discovery sweep: 89 / 89 pass. - `preflight_config.json` JSON validity: OK. - Behaviour for clusters that previously set neither `rocm_path` nor `amd_smi_path`: unchanged. - Behaviour for clusters that previously set them: same effect achieved by lifting the same PATH override into cluster file `env_vars` (the README change documents this migration). Co-authored-by: Cursor <cursoragent@cursor.com>
Addressed review feedback: drop
|
| Removed | Replaced by |
|---|---|
connectivity_check.transferbench.rocm_path |
env_vars in the cluster file (single cluster-wide source of truth) |
connectivity_check.transferbench.amd_smi_path |
Bare amd-smi resolved from PATH (set via cluster env_vars) |
TransferBenchSmokeCheck.__init__(rocm_path=..., amd_smi_path=...) |
(gone — TypeError if passed) |
TransferBenchSmokeCheck._rocm_env_prefix() |
(gone — build_command() no longer emits PATH= / LD_LIBRARY_PATH=) |
DEFAULT_AMD_SMI_PATH constant |
(gone) |
build_command() now only emits TransferBench-semantic env vars (NUM_ITERATIONS, ALWAYS_VALIDATE, RUN_PARALLEL, FORCE_SINGLE_POD, plus the optional TB_* socket-comm trio in multi_rank mode). _amd_smi_fabric_command() is now exactly [sudo ]amd-smi fabric --topology --json.
What was kept (intentionally) and why
tb_binarystayed in the per-check config (default"TransferBench", PATH-resolved).TransferBenchis a test-specific binary, not shared infrastructure — putting it in the cluster file would require every cluster file (across health, RCCL, RVS, etc.) to know about a per-preflight binary name. The pattern matches AIMVT-180'safmctl_pathknob, which is the same shape. A site that installsTransferBenchonPATHvia clusterenv_varsgets zero-config behaviour; overridetb_binaryhere only when this single preflight check needs to point at a different binary than the rest of the cluster's tooling.
If you'd rather we drop tb_binary too and force PATH-resolution unconditionally, happy to do that in a follow-up — let me know.
Migration for operators who previously set the removed knobs
Same effect, lifted up one layer:
// cluster.json
{
"env_vars": {
- // (was empty)
+ "PATH": "/opt/rocm/bin:$PATH",
+ "LD_LIBRARY_PATH": "/opt/rocm/lib:$LD_LIBRARY_PATH"
},
...
}
// preflight_config.json
{
"connectivity_check": {
"transferbench": {
"connectivity_mode": "run",
- "rocm_path": "/opt/rocm",
- "amd_smi_path": "amd-smi",
"tb_binary": "TransferBench",
...
}
}
}This is documented in cvs/input/config_file/preflight/README_preflight_config.md and cvs/tests/preflight/README.md with a link back to cvs/input/cluster_file/README.md.
Regression guards added
To prevent the duplication from creeping back, three new unit tests:
test_build_command_does_not_inject_path_or_ld_library_path— asserts that neitherPATH=norLD_LIBRARY_PATH=appears in the rendered command in either sudo or non-sudo paths.test_amd_smi_fabric_command_uses_bare_binary— asserts the pod-membership query is exactly[sudo ]amd-smi fabric --topology --json.test_constructor_rejects_removed_path_kwargs— asserts passing either removed kwarg raisesTypeErrorso stale callers fail loudly.
Verification
| Check | Result |
|---|---|
test_transferbench_smoke |
43 / 43 pass (40 originals + 3 new regression guards) |
test_ifoe_l2_connectivity (AIMVT-180 regression) |
20 / 20 pass |
| Full preflight unittest discovery sweep | 89 / 89 pass |
preflight_config.json JSON validity |
OK |
| Behaviour for clusters that previously set neither | Unchanged |
| Behaviour for clusters that previously set them | Same effect via cluster env_vars (documented) |
Ready for another look.
Summary
Adds an opt-in preflight check that validates IFoE (Infinity Fabric over
Ethernet, a.k.a. XGMI-over-Ethernet) scale-up data-path connectivity
one layer above the AIMVT-180 L2 ping, by running the TransferBench
candidate-branch
smoketestpreset on every reachable cluster node andreconciling the binary's exit code with the per-cell
[PASS] / [FAIL] / [SKIP]markers in its output. Disabled by default(
connectivity_check.transferbench.connectivity_mode = "skip"), so it hasno effect on clusters that don't run the candidate-branch TransferBench
build.
Stacked on
This PR is stacked on #188 (AIMVT-180, IFoE L2 ping). It inserts
test_ifoe_transferbench_smokebetweentest_ifoe_l2_connectivityandtest_rdma_connectivityin the same runner and extends the shared reportgenerator, so it builds directly on the AIMVT-180 pieces. GitHub will
auto-rebase / fast-forward once #188 lands; the AIMVT-181 commit itself
adds only the new files and the new wiring.
Motivation
AIMVT-180 covers L2 reachability (one
afmctl test pingper BDF / dstaccelerator pair), but does not exercise the IFoE data path the workloads
actually use. AIMVT-181 fills that gap: a fast pre-workload gate that
asks every reachable node to push real GPU-to-GPU traffic across the IFoE
fabric and validates the result before the heavier downstream tests
(RDMA full-mesh, RCCL, training) burn cycles on a broken fabric.
Technical Details
New module ---
cvs/lib/preflight/transferbench_smoke.pyTransferBenchSmokeCheck: orchestrates the smoketest dispatch in one oftwo modes:
per_node(default) --- each reachable node runs an independentsingle-rank TransferBench against its local GPUs. Exercises intra-node
AID↔MID IFoE hops but does not traverse the rack IFoE switch.
multi_rank--- every reachable node is wired into onesocket-comm cluster (
TB_NUM_RANKS=N,TB_RANK=0..N-1,TB_MASTER_ADDR=<rank0>,TB_MASTER_PORT=<configured>) and the wholefleet is launched via a single
phdl.exec_cmd_listcall so thepreset's socket-comm bootstrap can complete. Closest thing to a full
fabric scale-up test the candidate branch ships today. Auto-degrades
to
per_nodewhen fewer than two reachable hosts remain.extract_node_pod_membership()+reconcile_cluster_vpod(): tolerantparsers for
amd-smi fabric --topology --jsonpayloads. Handle aflat list, the
gpu_datawrapper, per-key dicts, and a plaintextfallback for amd-smi builds without
--jsonsupport. Used as apre-dispatch precondition: every node must report exactly one local
vpod_idand all nodes must share the samevpod_id(the smoketestpreset itself aborts with ERR_FATAL when ranks span multiple vPods, and
we want to surface that as a clear cluster-level error rather than as
an opaque exit-2 from TransferBench).
SmoketestParser: tolerant parser that accepts bracketed-verdict(
[PASS]/[FAIL]/[SKIP]), marker-block (*/F/.), and aggregateN/M PASS, x FAIL, y SKIPsummary shapes. Counts the markers, captureswarnings / fatal-keyword lines, and recovers the binary's exit code
from a
__TB_SMOKE_EXIT__=$?sentinel appended to stdout by theorchestrator (so we are not at the mercy of the parallel SSH layer's
exit-code handling).
evaluate_smoketest(): derives the per-nodePASS/FAIL/WARNINGverdict from the parsed result. Verdict precedence: sentinel missing
→ FAIL; exit 2 (ERR_FATAL precondition) → FAIL with a precondition
explanation; any non-zero exit → FAIL; FAIL markers / fatal-keyword
lines despite exit zero → FAIL (defence in depth);
num_skip / num_testsover the configured
max_skip_pct→ WARNING; else PASS.New helper ---
cvs/lib/rocm_plib.pyget_gpu_fabric_info_dict(phdl, use_sudo=True, amd_smi_path='amd-smi')joins the existing
amd-smi/rocm-smihelpers in this file. Returnsthe parsed amd-smi fabric JSON per node for other future consumers (the
preflight orchestrator uses its own copy that also tolerates plaintext
output, so this helper sits unused for now).
New pytest entry ---
cvs/tests/preflight/preflight_checks.pytest_ifoe_transferbench_smokeis wired in between the existingtest_ifoe_l2_connectivityand the existingtest_rdma_connectivity.Opt-in via
connectivity_check.transferbench.connectivity_mode(
"run"or"skip"; default"skip"). Failed nodes are reported butnot pruned from
phdl--- operators decide whether to gate downstreamtesting on the result. Registered with the report generator's required
checks list so it always renders in the executive summary.
New config block ---
cvs/input/config_file/preflight/preflight_config.jsonconnectivity_check.transferbench:connectivity_mode,tb_binary,rocm_path,amd_smi_path,use_sudo,preset,size_list,num_iterations,num_warmups,always_validate,run_parallel,use_bdma,force_single_pod,rank_mode,socket_master_port,master_node,max_skip_pct,ssh_timeout,skip_pod_check. Everykey has an inline
_comment_*doc.Reporting ---
cvs/lib/preflight/report.pyprecondition, rank-mode, totals (nodes pass/warn/fail, tests
pass/fail/skip), and a per-node failure detail table with verdict
errors, parsed marker counts, exit code, expandable stdout, and the
rendered command.
Documentation
cvs/tests/preflight/README.md: new "IFoE TransferBench Smoketest(AIMVT-181)" section with the precondition / orchestration / verdict
details and an example config block.
cvs/input/config_file/preflight/README_preflight_config.md: fullparameter reference, structure-overview update, and a callout for the
new opt-in block.
Command shape
Each rank's command is rendered as:
The env-var prefix lives inside the
bash -cso that, even withuse_sudo=True, the privileged child sees the assignments (sudootherwise sanitises its calling shell's environment).
Test Plan
python3 -m unittest cvs.lib.preflight.unittests.test_transferbench_smoke→ 40/40 pass. Coverage:
gpu_datawrapper / keyed-dict / mixed-vpod /plaintext / garbage payloads.
mixed-local-vPod.
fatal-precondition / marker-table fallback / empty / garbage.
exit-2 ERR_FATAL FAIL, skip-budget WARNING, and FAIL-markers-despite-exit-0
defence path.
socket env),
per_nodePASS, multi-rank dispatch viaexec_cmd_list, multi-rank degradation with 1 reachable host,pod-check bypass, plaintext fabric fallback, vPod-divergence FAIL,
one-failing-node FAIL, skip-budget WARNING, no-reachable-hosts FAIL,
exit-2 precondition FAIL.
python3 -m unittest cvs.lib.preflight.unittests.test_ifoe_l2_connectivity cvs.lib.preflight.unittests.test_rdma_connectivity cvs.lib.preflight.unittests.test_transferbench_smoke→ 86/86 pass.ruff checkon the touched files passes cleanly.python3 -m json.toolon the updatedpreflight_config.jsonparses cleanly.
connectivity_mode = "skip"means nobehavioral change for existing clusters; the new pytest entry records a
SKIPPED result and returns immediately without contacting nodes.
Out of Scope
a follow-up will layer a bandwidth-floor gate on top of the smoketest's
per-test bandwidth numbers once internal acceptance thresholds are
finalised).
preset is intentionally a single-vPod check.
(already merged) and the orchestrator fixture handle the install path.
Refs: AIMVT-181
Made with Cursor