Skip to content

feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference#242

Open
atnair-amd wants to merge 3 commits into
dev/dtnifrom
atnair/vllm-distributed
Open

feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference#242
atnair-amd wants to merge 3 commits into
dev/dtnifrom
atnair/vllm-distributed

Conversation

@atnair-amd

Copy link
Copy Markdown
Collaborator

Summary

Adds vllm_distributed, a new CVS validation suite for exercising vLLM multinode tensor-parallel + pipeline-parallel inference on 2-node MI300X clusters. Validated on Llama-3.1-70B-Instruct-FP8-KV with TP=8 per node × PP=2 across nodes (16 GPUs total, --distributed-executor-backend mp).

JIRA: AIMVT-247


Surfaces Changed

New Files

File Purpose
cvs/lib/inference/vllm_distributed.py VllmDistributedJob class — orchestrates head + worker node containers, applies 5 in-container vLLM patches per run, implements wait_ready() with FATAL_LOG_RE/EARLY_FAILURE_RE fast-fail detection
cvs/lib/inference/utils/vllm_distributed_config_loader.py Config loader / validator for vllm_distributed suite YAML/JSON configs; validates topology fields (nnodes, node_rank, master_addr, master_port), distributed executor args, and benchmark params
cvs/lib/inference/unittests/test_vllm_distributed.py 52 unit tests + 27 subtests for VllmDistributedJob and config loader; all passing
cvs/tests/inference/vllm_distributed/vllm_distributed.py pytest-based CVS test suite (316 lines)
cvs/tests/inference/vllm_distributed/__init__.py Package init
cvs/tests/inference/vllm_distributed/conftest.py pytest fixtures
cvs/input/config_file/inference/vllm_distributed/mi300x_vllm-distributed_llama31-70b_fp8_config.json Run config: Llama-3.1-70B-Instruct-FP8-KV, TP=8×PP=2, ISL=1000/OSL=1000/conc=16
cvs/input/config_file/inference/vllm_distributed/mi300x_vllm-distributed_llama31-70b_fp8_threshold.json Threshold file for pass/fail verdict

Modified Files

File Change
cvs/lib/inference_lib.py Registered vllm_distributed framework in _FRAMEWORK_CLASSES
cvs/core/orchestrators/container.py Added openssh-server fallback install for Docker images that ship without sshd; per-command timeout dict for SSH setup
cvs/lib/inference/unittests/test_vllm_orch_parse.py Fixed threshold JSON path reference

Key Engineering Details

In-Container Patching Strategy

Container lifetime is per_run (fresh container on every cvs run). All 5 patches are applied via build_server_cmd using a script-file approach (write Python patch to /tmp/vllm_patchN.py, run python3 /tmp/vllm_patchN.py) — avoids shell quoting issues that plagued earlier one-liner approaches:

  • Patch 0: Delete stale .pyc files (Docker image ships pre-compiled .pyc from original source; patched .py files would be shadowed without this step)
  • Patch 0b: multiproc_executor.py — replace assert self.rpc_broadcast_mq is not None with safe return for follower nodes where rpc_broadcast_mq is None
  • Patch 1: engine/core.py — guard _initialize_kv_caches() so it only runs on node_rank_within_dp == 0; followers get a stub KVCacheConfig
  • Patch 2: engine/core.py — stub Scheduler() for follower nodes (PP rank > 0 nodes don't schedule requests)
  • Patch 3: engine/core.py — fix get_supported_tasks() to return ("generate",) string literal for followers (SupportedTask is a Literal, not an Enum)

Fast-Fail Detection

wait_ready() runs two pre-poll checks:

  1. Pre-check (after initial sleep): tail -30 of server log scanned against EARLY_FAILURE_RE — catches immediate boot failures
  2. Post-warmup (after warmup sleep): grep -m1 -iE FATAL_LOG_RE — catches OOM ("Free memory on device less than desired"), engine init failures, and RuntimeErrors before entering the polling loop

Network / GPU Topology

  • GLOO_SOCKET_IFNAME=enp159s0np0 for inter-node gloo communication
  • --master-addr 10.245.135.15 --master-port 29501
  • enforce-eager: true (disables CUDA graph capture; required for this vLLM version on multi-node)

Validation

  • Node validation: v7a7 CVS run on 10.245.135.15 (head, node_rank=0) + 10.245.135.115 (worker, node_rank=1)
    • Both nodes: Application startup complete
    • Distributed args confirmed: --distributed-executor-backend mp, --nnodes 2, --tensor-parallel-size 8, --pipeline-parallel-size 2
    • Image: rocm/ufb-private:vllm-0.23.1rc0-ubuntu24.04-py3.12-nightlies-device-all-cdna-rocm7.14.0a20260624-92221485a
    • vLLM version: 0.23.1rc1.dev436+g92221485a.d20260625
  • Unit tests: 524 tests pass (make test), including 52 new test_vllm_distributed.py tests
  • Lint/format: make fmt-check && make lint clean (10.00/10 pylint, ruff pass)

…e inference

Introduces vllm_distributed, a new CVS inference validation framework for
2-node MI300X clusters running vLLM with tensor parallelism (TP=8) and
pipeline parallelism (PP=2) across 16 GPUs total via the multiprocessing
distributed executor backend.

New files:
  cvs/lib/inference/vllm_distributed.py          VllmDistributedJob class:
    - build_server_cmd applies 5 in-container patches per run to fix upstream
      vLLM bugs in the rocm/ufb-private nightlies image:
        Patch 0:  delete stale multiproc_executor.pyc and core.pyc
        Patch 0b: replace assert in multiproc_executor.py:collective_rpc
                  (rpc_broadcast_mq is None on PP follower nodes); return
                  safe default instead of crashing
        Patch 1:  guard _initialize_kv_caches() for follower nodes; use
                  dummy KVCacheConfig(num_blocks=1) to skip collective_rpc
        Patch 2:  stub Scheduler() with _F on follower nodes to skip
                  KVCacheManager/HybridKVCacheCoordinator assert
        Patch 3:  fix get_supported_tasks() to return ("generate",) for
                  follower nodes (SupportedTask is Literal, not Enum)
    - is_ready() / wait_ready(): per-poll readiness with fatal-log detection
    - run_client(): bench serve head-only via exec_on_head
    - postcheck(): validates server log, client log, result file
    - collect_logs(): zips node logs and HTML artifacts
  cvs/lib/inference/utils/vllm_distributed_config_loader.py  config schema
  cvs/lib/inference/unittests/test_vllm_distributed.py        52 unit tests
  cvs/tests/inference/vllm_distributed/                       pytest suite
  cvs/input/config_file/inference/vllm_distributed/           config + thresholds

Modified files:
  cvs/core/orchestrators/container.py    openssh-server fallback install for
                                         images without sshd; per-cmd timeout
  cvs/lib/inference_lib.py               register vllm_distributed framework
  cvs/lib/inference/unittests/test_vllm_orch_parse.py  fix threshold JSON path

Validated on 10.245.135.15 (g21u43, head) + 10.245.135.115 (h16u07, worker)
with amd/Llama-3.1-70B-Instruct-FP8-KV, ISL=1000 OSL=1000 concurrency=16.

Signed-off-by: Atul Nair <Atul.Nair@amd.com>
Signed-off-by: Atul Nair <Atul.Nair@amd.com>
@atnair-amd atnair-amd self-assigned this Jun 26, 2026
- Revert cvs/core/orchestrators/container.py: the openssh-server
  fallback install should not be in core; the ufb-private image already
  ships sshd (confirmed by v7a7 validation pass)
- Replace VllmDistributedJob alias with direct use: test suite imported
  VllmDistributedJob as VllmJob; now uses the class name directly
- Scrub personal references from config: threshold_json absolute path,
  master_addr IP, and GLOO/TP/NCCL_SOCKET_IFNAME NIC name replaced
  with <changeme> placeholders
- Remove VllmDistributedJob from InferenceJobFactory registry:
  VllmDistributedJob's constructor (orch, variant, ...) is incompatible
  with create_job's calling convention (c_phdl, s_phdl, ...) so the
  entry was unreachable dead code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant