feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference#242
Open
atnair-amd wants to merge 3 commits into
Open
feat(dtni): add vllm_distributed CVS suite for 2-node MI300X multinode inference#242atnair-amd wants to merge 3 commits into
atnair-amd wants to merge 3 commits into
Conversation
…e inference
Introduces vllm_distributed, a new CVS inference validation framework for
2-node MI300X clusters running vLLM with tensor parallelism (TP=8) and
pipeline parallelism (PP=2) across 16 GPUs total via the multiprocessing
distributed executor backend.
New files:
cvs/lib/inference/vllm_distributed.py VllmDistributedJob class:
- build_server_cmd applies 5 in-container patches per run to fix upstream
vLLM bugs in the rocm/ufb-private nightlies image:
Patch 0: delete stale multiproc_executor.pyc and core.pyc
Patch 0b: replace assert in multiproc_executor.py:collective_rpc
(rpc_broadcast_mq is None on PP follower nodes); return
safe default instead of crashing
Patch 1: guard _initialize_kv_caches() for follower nodes; use
dummy KVCacheConfig(num_blocks=1) to skip collective_rpc
Patch 2: stub Scheduler() with _F on follower nodes to skip
KVCacheManager/HybridKVCacheCoordinator assert
Patch 3: fix get_supported_tasks() to return ("generate",) for
follower nodes (SupportedTask is Literal, not Enum)
- is_ready() / wait_ready(): per-poll readiness with fatal-log detection
- run_client(): bench serve head-only via exec_on_head
- postcheck(): validates server log, client log, result file
- collect_logs(): zips node logs and HTML artifacts
cvs/lib/inference/utils/vllm_distributed_config_loader.py config schema
cvs/lib/inference/unittests/test_vllm_distributed.py 52 unit tests
cvs/tests/inference/vllm_distributed/ pytest suite
cvs/input/config_file/inference/vllm_distributed/ config + thresholds
Modified files:
cvs/core/orchestrators/container.py openssh-server fallback install for
images without sshd; per-cmd timeout
cvs/lib/inference_lib.py register vllm_distributed framework
cvs/lib/inference/unittests/test_vllm_orch_parse.py fix threshold JSON path
Validated on 10.245.135.15 (g21u43, head) + 10.245.135.115 (h16u07, worker)
with amd/Llama-3.1-70B-Instruct-FP8-KV, ISL=1000 OSL=1000 concurrency=16.
Signed-off-by: Atul Nair <Atul.Nair@amd.com>
Signed-off-by: Atul Nair <Atul.Nair@amd.com>
- Revert cvs/core/orchestrators/container.py: the openssh-server fallback install should not be in core; the ufb-private image already ships sshd (confirmed by v7a7 validation pass) - Replace VllmDistributedJob alias with direct use: test suite imported VllmDistributedJob as VllmJob; now uses the class name directly - Scrub personal references from config: threshold_json absolute path, master_addr IP, and GLOO/TP/NCCL_SOCKET_IFNAME NIC name replaced with <changeme> placeholders - Remove VllmDistributedJob from InferenceJobFactory registry: VllmDistributedJob's constructor (orch, variant, ...) is incompatible with create_job's calling convention (c_phdl, s_phdl, ...) so the entry was unreachable dead code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
vllm_distributed, a new CVS validation suite for exercising vLLM multinode tensor-parallel + pipeline-parallel inference on 2-node MI300X clusters. Validated on Llama-3.1-70B-Instruct-FP8-KV with TP=8 per node × PP=2 across nodes (16 GPUs total,--distributed-executor-backend mp).JIRA: AIMVT-247
Surfaces Changed
New Files
cvs/lib/inference/vllm_distributed.pyVllmDistributedJobclass — orchestrates head + worker node containers, applies 5 in-container vLLM patches per run, implementswait_ready()with FATAL_LOG_RE/EARLY_FAILURE_RE fast-fail detectioncvs/lib/inference/utils/vllm_distributed_config_loader.pyvllm_distributedsuite YAML/JSON configs; validates topology fields (nnodes,node_rank,master_addr,master_port), distributed executor args, and benchmark paramscvs/lib/inference/unittests/test_vllm_distributed.pyVllmDistributedJoband config loader; all passingcvs/tests/inference/vllm_distributed/vllm_distributed.pycvs/tests/inference/vllm_distributed/__init__.pycvs/tests/inference/vllm_distributed/conftest.pycvs/input/config_file/inference/vllm_distributed/mi300x_vllm-distributed_llama31-70b_fp8_config.jsoncvs/input/config_file/inference/vllm_distributed/mi300x_vllm-distributed_llama31-70b_fp8_threshold.jsonModified Files
cvs/lib/inference_lib.pyvllm_distributedframework in_FRAMEWORK_CLASSEScvs/core/orchestrators/container.pyopenssh-serverfallback install for Docker images that ship withoutsshd; per-command timeout dict for SSH setupcvs/lib/inference/unittests/test_vllm_orch_parse.pyKey Engineering Details
In-Container Patching Strategy
Container lifetime is
per_run(fresh container on everycvs run). All 5 patches are applied viabuild_server_cmdusing a script-file approach (write Python patch to/tmp/vllm_patchN.py, runpython3 /tmp/vllm_patchN.py) — avoids shell quoting issues that plagued earlier one-liner approaches:.pycfiles (Docker image ships pre-compiled.pycfrom original source; patched.pyfiles would be shadowed without this step)multiproc_executor.py— replaceassert self.rpc_broadcast_mq is not Nonewith safe return for follower nodes whererpc_broadcast_mq is Noneengine/core.py— guard_initialize_kv_caches()so it only runs onnode_rank_within_dp == 0; followers get a stubKVCacheConfigengine/core.py— stubScheduler()for follower nodes (PP rank > 0 nodes don't schedule requests)engine/core.py— fixget_supported_tasks()to return("generate",)string literal for followers (SupportedTaskis aLiteral, not anEnum)Fast-Fail Detection
wait_ready()runs two pre-poll checks:tail -30of server log scanned againstEARLY_FAILURE_RE— catches immediate boot failuresgrep -m1 -iE FATAL_LOG_RE— catches OOM ("Free memory on device less than desired"), engine init failures, and RuntimeErrors before entering the polling loopNetwork / GPU Topology
GLOO_SOCKET_IFNAME=enp159s0np0for inter-node gloo communication--master-addr 10.245.135.15 --master-port 29501enforce-eager: true(disables CUDA graph capture; required for this vLLM version on multi-node)Validation
10.245.135.15(head, node_rank=0) +10.245.135.115(worker, node_rank=1)Application startup complete--distributed-executor-backend mp,--nnodes 2,--tensor-parallel-size 8,--pipeline-parallel-size 2rocm/ufb-private:vllm-0.23.1rc0-ubuntu24.04-py3.12-nightlies-device-all-cdna-rocm7.14.0a20260624-92221485a0.23.1rc1.dev436+g92221485a.d20260625make test), including 52 newtest_vllm_distributed.pytestsmake fmt-check && make lintclean (10.00/10 pylint, ruff pass)