Skip to content

Add IFoE TransferBench scale-up preflight check [AIMVT-181]#192

Open
speriaswamy-amd wants to merge 2 commits into
surya/aimvt-180-ifoe-l2-preflightfrom
surya/aimvt-181-ifoe-transferbench-preflight
Open

Add IFoE TransferBench scale-up preflight check [AIMVT-181]#192
speriaswamy-amd wants to merge 2 commits into
surya/aimvt-180-ifoe-l2-preflightfrom
surya/aimvt-181-ifoe-transferbench-preflight

Conversation

@speriaswamy-amd

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in preflight check that validates IFoE (Infinity Fabric over
Ethernet, a.k.a. XGMI-over-Ethernet) scale-up data-path connectivity
one layer above the AIMVT-180 L2 ping, by running the TransferBench
candidate-branch smoketest preset on every reachable cluster node and
reconciling the binary's exit code with the per-cell
[PASS] / [FAIL] / [SKIP] markers in its output. Disabled by default
(connectivity_check.transferbench.connectivity_mode = "skip"), so it has
no effect on clusters that don't run the candidate-branch TransferBench
build.

Stacked on

This PR is stacked on #188 (AIMVT-180, IFoE L2 ping). It inserts
test_ifoe_transferbench_smoke between test_ifoe_l2_connectivity and
test_rdma_connectivity in the same runner and extends the shared report
generator, so it builds directly on the AIMVT-180 pieces. GitHub will
auto-rebase / fast-forward once #188 lands; the AIMVT-181 commit itself
adds only the new files and the new wiring.

Motivation

AIMVT-180 covers L2 reachability (one afmctl test ping per BDF / dst
accelerator pair), but does not exercise the IFoE data path the workloads
actually use. AIMVT-181 fills that gap: a fast pre-workload gate that
asks every reachable node to push real GPU-to-GPU traffic across the IFoE
fabric and validates the result before the heavier downstream tests
(RDMA full-mesh, RCCL, training) burn cycles on a broken fabric.

Technical Details

New module --- cvs/lib/preflight/transferbench_smoke.py

  • TransferBenchSmokeCheck: orchestrates the smoketest dispatch in one of
    two modes:
    • per_node (default) --- each reachable node runs an independent
      single-rank TransferBench against its local GPUs. Exercises intra-node
      AID↔MID IFoE hops but does not traverse the rack IFoE switch.
    • multi_rank --- every reachable node is wired into one
      socket-comm cluster (TB_NUM_RANKS=N, TB_RANK=0..N-1,
      TB_MASTER_ADDR=<rank0>, TB_MASTER_PORT=<configured>) and the whole
      fleet is launched via a single phdl.exec_cmd_list call so the
      preset's socket-comm bootstrap can complete. Closest thing to a full
      fabric scale-up test the candidate branch ships today. Auto-degrades
      to per_node when fewer than two reachable hosts remain.
  • extract_node_pod_membership() + reconcile_cluster_vpod(): tolerant
    parsers for amd-smi fabric --topology --json payloads. Handle a
    flat list, the gpu_data wrapper, per-key dicts, and a plaintext
    fallback for amd-smi builds without --json support. Used as a
    pre-dispatch precondition: every node must report exactly one local
    vpod_id and all nodes must share the same vpod_id (the smoketest
    preset itself aborts with ERR_FATAL when ranks span multiple vPods, and
    we want to surface that as a clear cluster-level error rather than as
    an opaque exit-2 from TransferBench).
  • SmoketestParser: tolerant parser that accepts bracketed-verdict
    ([PASS]/[FAIL]/[SKIP]), marker-block (*/F/.), and aggregate
    N/M PASS, x FAIL, y SKIP summary shapes. Counts the markers, captures
    warnings / fatal-keyword lines, and recovers the binary's exit code
    from a __TB_SMOKE_EXIT__=$? sentinel appended to stdout by the
    orchestrator (so we are not at the mercy of the parallel SSH layer's
    exit-code handling).
  • evaluate_smoketest(): derives the per-node PASS / FAIL / WARNING
    verdict from the parsed result. Verdict precedence: sentinel missing
    → FAIL; exit 2 (ERR_FATAL precondition) → FAIL with a precondition
    explanation; any non-zero exit → FAIL; FAIL markers / fatal-keyword
    lines despite exit zero → FAIL (defence in depth); num_skip / num_tests
    over the configured max_skip_pct → WARNING; else PASS.

New helper --- cvs/lib/rocm_plib.py

get_gpu_fabric_info_dict(phdl, use_sudo=True, amd_smi_path='amd-smi')
joins the existing amd-smi / rocm-smi helpers in this file. Returns
the parsed amd-smi fabric JSON per node for other future consumers (the
preflight orchestrator uses its own copy that also tolerates plaintext
output, so this helper sits unused for now).

New pytest entry --- cvs/tests/preflight/preflight_checks.py

test_ifoe_transferbench_smoke is wired in between the existing
test_ifoe_l2_connectivity and the existing test_rdma_connectivity.
Opt-in via connectivity_check.transferbench.connectivity_mode
("run" or "skip"; default "skip"). Failed nodes are reported but
not pruned from phdl --- operators decide whether to gate downstream
testing on the result. Registered with the report generator's required
checks list so it always renders in the executive summary.

New config block --- cvs/input/config_file/preflight/preflight_config.json

connectivity_check.transferbench: connectivity_mode, tb_binary,
rocm_path, amd_smi_path, use_sudo, preset, size_list,
num_iterations, num_warmups, always_validate, run_parallel,
use_bdma, force_single_pod, rank_mode, socket_master_port,
master_node, max_skip_pct, ssh_timeout, skip_pod_check. Every
key has an inline _comment_* doc.

Reporting --- cvs/lib/preflight/report.py

  • Adds the TransferBench smoketest row to the executive summary table.
  • Adds dedicated HTML section: shared vpod/ppod state from the
    precondition, rank-mode, totals (nodes pass/warn/fail, tests
    pass/fail/skip), and a per-node failure detail table with verdict
    errors, parsed marker counts, exit code, expandable stdout, and the
    rendered command.
  • Adds recommendations for FAIL and WARNING terminal states.

Documentation

  • cvs/tests/preflight/README.md: new "IFoE TransferBench Smoketest
    (AIMVT-181)" section with the precondition / orchestration / verdict
    details and an example config block.
  • cvs/input/config_file/preflight/README_preflight_config.md: full
    parameter reference, structure-overview update, and a callout for the
    new opt-in block.

Command shape

Each rank's command is rendered as:

[sudo] bash -c '[PATH=<rocm>/bin:$PATH LD_LIBRARY_PATH=<rocm>/lib:${LD_LIBRARY_PATH:-}] \
  NUM_ITERATIONS=<n> NUM_WARMUPS=<n> ALWAYS_VALIDATE=1 RUN_PARALLEL=1 \
  USE_REMOTE_READ=1 BLOCK_BYTES=256 [USE_BDMA=1] [FORCE_SINGLE_POD=1] \
  [TB_NUM_RANKS=<n> TB_RANK=<r> TB_MASTER_ADDR=<rank0> TB_MASTER_PORT=<port>] \
  <tb_binary> smoketest <size_list...>; echo "__TB_SMOKE_EXIT__=$?"'

The env-var prefix lives inside the bash -c so that, even with
use_sudo=True, the privileged child sees the assignments (sudo
otherwise sanitises its calling shell's environment).

Test Plan

  • Unit tests: python3 -m unittest cvs.lib.preflight.unittests.test_transferbench_smoke
    40/40 pass. Coverage:
    • Topology parser: list / gpu_data wrapper / keyed-dict / mixed-vpod /
      plaintext / garbage payloads.
    • Cluster reconcile: uniform / split / missing-on-some-nodes /
      mixed-local-vPod.
    • Smoketest parser: passing / failing / skip-heavy /
      fatal-precondition / marker-table fallback / empty / garbage.
    • Verdict logic: every branch including sentinel-missing FAIL,
      exit-2 ERR_FATAL FAIL, skip-budget WARNING, and FAIL-markers-despite-exit-0
      defence path.
    • Orchestrator: command rendering (defaults, sudo + ROCm path, multi-rank
      socket env), per_node PASS, multi-rank dispatch via
      exec_cmd_list, multi-rank degradation with 1 reachable host,
      pod-check bypass, plaintext fabric fallback, vPod-divergence FAIL,
      one-failing-node FAIL, skip-budget WARNING, no-reachable-hosts FAIL,
      exit-2 precondition FAIL.
  • AIMVT-180 + RDMA regression: python3 -m unittest cvs.lib.preflight.unittests.test_ifoe_l2_connectivity cvs.lib.preflight.unittests.test_rdma_connectivity cvs.lib.preflight.unittests.test_transferbench_smoke86/86 pass.
  • Lint: ruff check on the touched files passes cleanly.
  • Config: python3 -m json.tool on the updated preflight_config.json
    parses cleanly.
  • Backwards compatibility: default connectivity_mode = "skip" means no
    behavioral change for existing clusters; the new pytest entry records a
    SKIPPED result and returns immediately without contacting nodes.

Out of Scope

  • Performance gating (this PR only enforces functional PASS/FAIL/SKIP;
    a follow-up will layer a bandwidth-floor gate on top of the smoketest's
    per-test bandwidth numbers once internal acceptance thresholds are
    finalised).
  • Cross-pod (RNIC scale-out) data-path validation --- the smoketest
    preset is intentionally a single-vPod check.
  • Installing TransferBench / the candidate-branch build --- AIMVT-171
    (already merged) and the orchestrator fixture handle the install path.

Refs: AIMVT-181

Made with Cursor

speriaswamy-amd and others added 2 commits May 29, 2026 14:17
Adds an opt-in preflight check that validates IFoE (Infinity Fabric over
Ethernet a.k.a. XGMI-over-Ethernet) scale-up data-path connectivity one
layer above the AIMVT-180 L2 ping, by running the TransferBench candidate
branch `smoketest` preset on every reachable cluster node and reconciling
the binary's exit code with the per-cell `[PASS] / [FAIL] / [SKIP]`
markers in its output. Disabled by default
(`connectivity_check.transferbench.connectivity_mode = "skip"`), so it has
no effect on clusters that don't run the candidate-branch TransferBench
build.

Before the smoketest dispatches, the check enforces a single-vPod
precondition by parsing `amd-smi fabric --topology --json` on every
reachable node -- the TransferBench smoketest preset itself exits with
ERR_FATAL (exit 2) when ranks span multiple virtual pods, so we surface
the underlying environment issue with a clear cluster-level error rather
than blaming the binary.

Changes:
- New module `cvs/lib/preflight/transferbench_smoke.py` with:
  - `TransferBenchSmokeCheck` orchestrator (`per_node` independent runs
    and `multi_rank` socket-comm runs that thread `TB_NUM_RANKS=N` /
    `TB_RANK=0..N-1` / `TB_MASTER_ADDR=<rank0>` through one parallel SSH
    dispatch so the smoketest's bootstrap can complete).
  - `extract_node_pod_membership()` / `reconcile_cluster_vpod()` that
    handle list, `gpu_data` wrapper, and per-key dict shapes of the
    amd-smi fabric JSON, plus a plaintext fallback parser for builds
    without --json.
  - `SmoketestParser` that handles bracketed-verdict, marker-block, and
    aggregate-summary output shapes, and `evaluate_smoketest()` that
    derives the per-node PASS / FAIL / WARNING verdict from the binary's
    exit code (recovered via a `__TB_SMOKE_EXIT__=$?` sentinel appended
    to stdout), reported markers, and a configurable skip-budget.
- New pytest entry `test_ifoe_transferbench_smoke` wired into
  `cvs/tests/preflight/preflight_checks.py` between
  `test_ifoe_l2_connectivity` and `test_rdma_connectivity`.
- New `connectivity_check.transferbench` config block in
  `preflight_config.json` (tb_binary, rocm_path, amd_smi_path, use_sudo,
  preset, size_list, num_iterations / num_warmups, always_validate,
  run_parallel, use_bdma, force_single_pod, rank_mode,
  socket_master_port, master_node, max_skip_pct, ssh_timeout,
  skip_pod_check). Every key has an inline `_comment_*` doc.
- Executive-summary entry + dedicated HTML section in
  `cvs/lib/preflight/report.py` with shared vpod/ppod state, rank-mode,
  totals, and per-node failure detail (verdict errors, parsed marker
  counts, exit code, stdout snippets, and the rendered command).
- New `get_gpu_fabric_info_dict()` helper in `cvs/lib/rocm_plib.py`
  alongside the existing `amd-smi`/`rocm-smi` helpers, returning parsed
  amd-smi fabric JSON per node for other consumers.
- 40 unit tests covering the topology parser (list / gpu_data /
  keyed / mixed-vpod / plaintext / garbage payloads), the cluster
  reconcile (uniform / split / missing / mixed-local vPod), the output
  parser (passing / failing / skip-heavy / fatal-precondition / marker
  table / empty / garbage), the verdict logic (every branch including
  sentinel-missing and skip-budget WARNING), and the orchestrator
  (command rendering with sudo + ROCm path, per_node + multi_rank
  dispatch, multi_rank degradation, pod-check bypass, plaintext fabric
  fallback, no-reachable-hosts).

Documentation:
- `cvs/tests/preflight/README.md` and
  `cvs/input/config_file/preflight/README_preflight_config.md` updated
  with the new check, its precondition, the rank-mode trade-off, the
  verdict logic, and an example config block.

Refs: AIMVT-181

Made with [Cursor](https://cursor.com)

Co-authored-by: Cursor <cursoragent@cursor.com>
…er PATH to cluster env_vars [AIMVT-181]

PR #192 review feedback: the cluster file already exposes a top-level
`env_vars` dict that the parallel SSH layer exports on every host before
each command (see `cvs/input/cluster_file/README.md` and the
`env_vars=env_vars` wiring at `cvs/tests/preflight/preflight_checks.py`
around L206). Re-exposing `rocm_path` and `amd_smi_path` in the
transferbench preflight block duplicated that mechanism for a single
check and gave operators two non-orthogonal ways to point at a
non-default ROCm install.

This change removes the duplication and standardises on the cluster
file as the single cluster-wide source of truth for tool location.

Removed:
- `connectivity_check.transferbench.rocm_path` (config key)
- `connectivity_check.transferbench.amd_smi_path` (config key)
- `TransferBenchSmokeCheck(..., rocm_path=..., amd_smi_path=...)`
  constructor kwargs
- `TransferBenchSmokeCheck._rocm_env_prefix()` (private helper that
  emitted inline `PATH=<rocm>/bin:$PATH LD_LIBRARY_PATH=<rocm>/lib:...`)
- `DEFAULT_AMD_SMI_PATH` module-level constant

Kept (intentionally):
- `connectivity_check.transferbench.tb_binary` (default `"TransferBench"`).
  TransferBench is a test-specific binary -- not shared infrastructure --
  so it stays in the per-check config rather than polluting the cluster
  file with a per-test name. Defaults to PATH-resolution so a site that
  installs it on PATH via cluster `env_vars` gets zero-config behaviour;
  override here only when this single check needs to point at a different
  binary than the rest of the cluster's tooling. Same shape as the
  AIMVT-180 `afmctl_path` knob.

After this change, `TransferBenchSmokeCheck.build_command()` emits only
TransferBench-semantic env vars (NUM_ITERATIONS, ALWAYS_VALIDATE,
RUN_PARALLEL, FORCE_SINGLE_POD, optional TB_NUM_RANKS / TB_RANK /
TB_MASTER_ADDR / TB_MASTER_PORT) inside the inner `bash -c` shell. PATH
/ LD_LIBRARY_PATH come exclusively from the cluster file `env_vars`
block. `_amd_smi_fabric_command()` uses bare `amd-smi`, also
PATH-resolved on each node.

Test updates:
- Replaced the `test_build_command_respects_sudo_and_rocm_path` test
  with two regression guards:
  - `test_build_command_does_not_inject_path_or_ld_library_path` --
    asserts neither `PATH=` nor `LD_LIBRARY_PATH=` appears in the
    rendered command in either sudo or non-sudo paths.
  - `test_amd_smi_fabric_command_uses_bare_binary` -- asserts the pod
    membership query is exactly `[sudo ]amd-smi fabric --topology --json`.
- Added `test_constructor_rejects_removed_path_kwargs` so stale callers
  passing the removed kwargs fail loudly with `TypeError` instead of
  being silently accepted.

Doc updates:
- `cvs/input/config_file/preflight/preflight_config.json`: dropped the
  two keys + their `_comment_*` fields; expanded the top-level
  transferbench `_comment` to point operators at cluster file `env_vars`.
- `cvs/input/config_file/preflight/README_preflight_config.md`: dropped
  the two bullet rows; added a PATH / LD_LIBRARY_PATH note that links to
  `cvs/input/cluster_file/README.md`.
- `cvs/tests/preflight/README.md`: dropped the inline
  `PATH=<rocm>/bin... LD_LIBRARY_PATH=<rocm>/lib...` fragment from the
  command template; trimmed the example JSON block.

Verification:
- New unit tests: 43 / 43 pass (40 originals + 3 new regression guards).
- AIMVT-180 IFoE L2 regression: 20 / 20 pass.
- Full preflight unittest discovery sweep: 89 / 89 pass.
- `preflight_config.json` JSON validity: OK.
- Behaviour for clusters that previously set neither `rocm_path` nor
  `amd_smi_path`: unchanged.
- Behaviour for clusters that previously set them: same effect achieved
  by lifting the same PATH override into cluster file `env_vars` (the
  README change documents this migration).

Co-authored-by: Cursor <cursoragent@cursor.com>
@speriaswamy-amd

Copy link
Copy Markdown
Contributor Author

Addressed review feedback: drop rocm_path / amd_smi_path; defer PATH to cluster env_vars

Pushed 57c7b7b. Six files, +98 / -57.

What changed

Removed the two redundant per-check knobs from the transferbench preflight config and from TransferBenchSmokeCheck itself:

Removed Replaced by
connectivity_check.transferbench.rocm_path env_vars in the cluster file (single cluster-wide source of truth)
connectivity_check.transferbench.amd_smi_path Bare amd-smi resolved from PATH (set via cluster env_vars)
TransferBenchSmokeCheck.__init__(rocm_path=..., amd_smi_path=...) (gone — TypeError if passed)
TransferBenchSmokeCheck._rocm_env_prefix() (gone — build_command() no longer emits PATH= / LD_LIBRARY_PATH=)
DEFAULT_AMD_SMI_PATH constant (gone)

build_command() now only emits TransferBench-semantic env vars (NUM_ITERATIONS, ALWAYS_VALIDATE, RUN_PARALLEL, FORCE_SINGLE_POD, plus the optional TB_* socket-comm trio in multi_rank mode). _amd_smi_fabric_command() is now exactly [sudo ]amd-smi fabric --topology --json.

What was kept (intentionally) and why

  • tb_binary stayed in the per-check config (default "TransferBench", PATH-resolved). TransferBench is a test-specific binary, not shared infrastructure — putting it in the cluster file would require every cluster file (across health, RCCL, RVS, etc.) to know about a per-preflight binary name. The pattern matches AIMVT-180's afmctl_path knob, which is the same shape. A site that installs TransferBench on PATH via cluster env_vars gets zero-config behaviour; override tb_binary here only when this single preflight check needs to point at a different binary than the rest of the cluster's tooling.

If you'd rather we drop tb_binary too and force PATH-resolution unconditionally, happy to do that in a follow-up — let me know.

Migration for operators who previously set the removed knobs

Same effect, lifted up one layer:

 // cluster.json
 {
   "env_vars": {
-    // (was empty)
+    "PATH": "/opt/rocm/bin:$PATH",
+    "LD_LIBRARY_PATH": "/opt/rocm/lib:$LD_LIBRARY_PATH"
   },
   ...
 }

 // preflight_config.json
 {
   "connectivity_check": {
     "transferbench": {
       "connectivity_mode": "run",
-      "rocm_path": "/opt/rocm",
-      "amd_smi_path": "amd-smi",
       "tb_binary": "TransferBench",
       ...
     }
   }
 }

This is documented in cvs/input/config_file/preflight/README_preflight_config.md and cvs/tests/preflight/README.md with a link back to cvs/input/cluster_file/README.md.

Regression guards added

To prevent the duplication from creeping back, three new unit tests:

  1. test_build_command_does_not_inject_path_or_ld_library_path — asserts that neither PATH= nor LD_LIBRARY_PATH= appears in the rendered command in either sudo or non-sudo paths.
  2. test_amd_smi_fabric_command_uses_bare_binary — asserts the pod-membership query is exactly [sudo ]amd-smi fabric --topology --json.
  3. test_constructor_rejects_removed_path_kwargs — asserts passing either removed kwarg raises TypeError so stale callers fail loudly.

Verification

Check Result
test_transferbench_smoke 43 / 43 pass (40 originals + 3 new regression guards)
test_ifoe_l2_connectivity (AIMVT-180 regression) 20 / 20 pass
Full preflight unittest discovery sweep 89 / 89 pass
preflight_config.json JSON validity OK
Behaviour for clusters that previously set neither Unchanged
Behaviour for clusters that previously set them Same effect via cluster env_vars (documented)

Ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant