feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference by atnair-amd · Pull Request #242 · ROCm/cvs

atnair-amd · 2026-06-26T21:21:03Z

Summary

Adds vllm_distributed, a new CVS validation suite for exercising vLLM multinode tensor-parallel + pipeline-parallel inference on 2-node MI300X clusters. Validated on Llama-3.1-70B-Instruct-FP8-KV with TP=8 per node × PP=2 across nodes (16 GPUs total, --distributed-executor-backend mp).

JIRA: AIMVT-247

Surfaces Changed

New Files

File	Purpose
`cvs/lib/inference/vllm_distributed.py`	`VllmDistributedJob` class — orchestrates head + worker node containers, applies 5 in-container vLLM patches per run, implements `wait_ready()` with FATAL_LOG_RE/EARLY_FAILURE_RE fast-fail detection
`cvs/lib/inference/utils/vllm_distributed_config_loader.py`	Config loader / validator for `vllm_distributed` suite YAML/JSON configs; validates topology fields (`nnodes`, `node_rank`, `master_addr`, `master_port`), distributed executor args, and benchmark params
`cvs/lib/inference/unittests/test_vllm_distributed.py`	52 unit tests + 27 subtests for `VllmDistributedJob` and config loader; all passing
`cvs/tests/inference/vllm_distributed/vllm_distributed.py`	pytest-based CVS test suite (316 lines)
`cvs/tests/inference/vllm_distributed/__init__.py`	Package init
`cvs/tests/inference/vllm_distributed/conftest.py`	pytest fixtures
`cvs/input/config_file/inference/vllm_distributed/mi300x_vllm-distributed_llama31-70b_fp8_config.json`	Run config: Llama-3.1-70B-Instruct-FP8-KV, TP=8×PP=2, ISL=1000/OSL=1000/conc=16
`cvs/input/config_file/inference/vllm_distributed/mi300x_vllm-distributed_llama31-70b_fp8_threshold.json`	Threshold file for pass/fail verdict

Modified Files

File	Change
`cvs/lib/inference_lib.py`	Registered `vllm_distributed` framework in `_FRAMEWORK_CLASSES`
`cvs/core/orchestrators/container.py`	Added `openssh-server` fallback install for Docker images that ship without `sshd`; per-command timeout dict for SSH setup
`cvs/lib/inference/unittests/test_vllm_orch_parse.py`	Fixed threshold JSON path reference

Key Engineering Details

In-Container Patching Strategy

Container lifetime is per_run (fresh container on every cvs run). All 5 patches are applied via build_server_cmd using a script-file approach (write Python patch to /tmp/vllm_patchN.py, run python3 /tmp/vllm_patchN.py) — avoids shell quoting issues that plagued earlier one-liner approaches:

Patch 0: Delete stale .pyc files (Docker image ships pre-compiled .pyc from original source; patched .py files would be shadowed without this step)
Patch 0b: multiproc_executor.py — replace assert self.rpc_broadcast_mq is not None with safe return for follower nodes where rpc_broadcast_mq is None
Patch 1: engine/core.py — guard _initialize_kv_caches() so it only runs on node_rank_within_dp == 0; followers get a stub KVCacheConfig
Patch 2: engine/core.py — stub Scheduler() for follower nodes (PP rank > 0 nodes don't schedule requests)
Patch 3: engine/core.py — fix get_supported_tasks() to return ("generate",) string literal for followers (SupportedTask is a Literal, not an Enum)

Fast-Fail Detection

wait_ready() runs two pre-poll checks:

Pre-check (after initial sleep): tail -30 of server log scanned against EARLY_FAILURE_RE — catches immediate boot failures
Post-warmup (after warmup sleep): grep -m1 -iE FATAL_LOG_RE — catches OOM ("Free memory on device less than desired"), engine init failures, and RuntimeErrors before entering the polling loop

Network / GPU Topology

GLOO_SOCKET_IFNAME=enp159s0np0 for inter-node gloo communication
--master-addr 10.245.135.15 --master-port 29501
enforce-eager: true (disables CUDA graph capture; required for this vLLM version on multi-node)

Validation

Node validation: v7a7 CVS run on 10.245.135.15 (head, node_rank=0) + 10.245.135.115 (worker, node_rank=1)
- Both nodes: Application startup complete
- Distributed args confirmed: --distributed-executor-backend mp, --nnodes 2, --tensor-parallel-size 8, --pipeline-parallel-size 2
- Image: rocm/ufb-private:vllm-0.23.1rc0-ubuntu24.04-py3.12-nightlies-device-all-cdna-rocm7.14.0a20260624-92221485a
- vLLM version: 0.23.1rc1.dev436+g92221485a.d20260625
Unit tests: 524 tests pass (make test), including 52 new test_vllm_distributed.py tests
Lint/format: make fmt-check && make lint clean (10.00/10 pylint, ruff pass)

…e inference Introduces vllm_distributed, a new CVS inference validation framework for 2-node MI300X clusters running vLLM with tensor parallelism (TP=8) and pipeline parallelism (PP=2) across 16 GPUs total via the multiprocessing distributed executor backend. New files: cvs/lib/inference/vllm_distributed.py VllmDistributedJob class: - build_server_cmd applies 5 in-container patches per run to fix upstream vLLM bugs in the rocm/ufb-private nightlies image: Patch 0: delete stale multiproc_executor.pyc and core.pyc Patch 0b: replace assert in multiproc_executor.py:collective_rpc (rpc_broadcast_mq is None on PP follower nodes); return safe default instead of crashing Patch 1: guard _initialize_kv_caches() for follower nodes; use dummy KVCacheConfig(num_blocks=1) to skip collective_rpc Patch 2: stub Scheduler() with _F on follower nodes to skip KVCacheManager/HybridKVCacheCoordinator assert Patch 3: fix get_supported_tasks() to return ("generate",) for follower nodes (SupportedTask is Literal, not Enum) - is_ready() / wait_ready(): per-poll readiness with fatal-log detection - run_client(): bench serve head-only via exec_on_head - postcheck(): validates server log, client log, result file - collect_logs(): zips node logs and HTML artifacts cvs/lib/inference/utils/vllm_distributed_config_loader.py config schema cvs/lib/inference/unittests/test_vllm_distributed.py 52 unit tests cvs/tests/inference/vllm_distributed/ pytest suite cvs/input/config_file/inference/vllm_distributed/ config + thresholds Modified files: cvs/core/orchestrators/container.py openssh-server fallback install for images without sshd; per-cmd timeout cvs/lib/inference_lib.py register vllm_distributed framework cvs/lib/inference/unittests/test_vllm_orch_parse.py fix threshold JSON path Validated on 10.245.135.15 (g21u43, head) + 10.245.135.115 (h16u07, worker) with amd/Llama-3.1-70B-Instruct-FP8-KV, ISL=1000 OSL=1000 concurrency=16. Signed-off-by: Atul Nair <Atul.Nair@amd.com>

Signed-off-by: Atul Nair <Atul.Nair@amd.com>

- Revert cvs/core/orchestrators/container.py: the openssh-server fallback install should not be in core; the ufb-private image already ships sshd (confirmed by v7a7 validation pass) - Replace VllmDistributedJob alias with direct use: test suite imported VllmDistributedJob as VllmJob; now uses the class name directly - Scrub personal references from config: threshold_json absolute path, master_addr IP, and GLOO/TP/NCCL_SOCKET_IFNAME NIC name replaced with <changeme> placeholders - Remove VllmDistributedJob from InferenceJobFactory registry: VllmDistributedJob's constructor (orch, variant, ...) is incompatible with create_job's calling convention (c_phdl, s_phdl, ...) so the entry was unreachable dead code

atnair-amd added 2 commits June 26, 2026 17:11

style: apply ruff formatting to vllm_distributed suite files

5013ee5

Signed-off-by: Atul Nair <Atul.Nair@amd.com>

atnair-amd self-assigned this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference#242

feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference#242
atnair-amd wants to merge 3 commits into
dev/dtnifrom
atnair/vllm-distributed

atnair-amd commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

atnair-amd commented Jun 26, 2026

Summary

Surfaces Changed

New Files

Modified Files

Key Engineering Details

In-Container Patching Strategy

Fast-Fail Detection

Network / GPU Topology

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant