[CVS] GPU metrics polling integration for inference validation suites#241
Open
atnair-amd wants to merge 7 commits into
Open
[CVS] GPU metrics polling integration for inference validation suites#241atnair-amd wants to merge 7 commits into
atnair-amd wants to merge 7 commits into
Conversation
- gpu.py: parse_gpu_metrics, capture_gpu_metrics, _mean, agg_readings, poll_gpu_metrics - VllmJob.is_client_done(): non-raising completion predicate - vllm_single test: poll GPU while client runs, write gpu_poll.log, derive 5 metrics - _shared.py: Peak VRAM / Compute % / BW % columns in results table - test_gpu.py: TestMean, TestAggReadings, TestPollGpuMetrics unit test classes - threshold JSON: gpu.* placeholder SLO entries for all 5 cells - test_vllm_orch_parse: update threshold path + exclude gpu.* from client key guard
The fixture was referenced in test_vllm_inference's parameter list but never defined, causing a setup Error before any inference ran.
amd-smi is a host-side tool — running it via orch.exec() sends it into the container where it doesn't exist. Switch capture_gpu_metrics to orch.exec_on_head() so the command runs on the bare-metal node. Also ensure the out_dir exists before poll_gpu_metrics attempts to write gpu_poll.log, since the directory is created lazily by the job setup. Update unit test mocks from exec to exec_on_head to match.
…c_on_head out_dir is an NFS path on the node, not mounted on the devbox. Write the log to a local tempdir, then base64-encode it and push it to the node via exec_on_head so it lands in the bundle.
…rank Move import time/logging/pathlib from inside poll_gpu_metrics body to module top-level. Add test_gpu_metric at rank 4 in conftest sort table so it runs before test_teardown, not after.
Add gpu.py API reference to cvs/lib/utils/AGENTS.md: public symbols, poll_gpu_metrics parameter table, 5-metric derivation table, required conftest fixtures (gpu_metrics_snap), two wiring patterns (sync poll / threaded poll), pytest_generate_tests parametrize branch, collection sort rank table, and gotchas (threshold key prefix, capture can raise, or-None semantics, full actuals for evaluate_all, GATED_METRICS). Add cvs/lib/utils/docs/gpu-metrics.md: user-facing integration guide covering the 5 derived metrics, polling lifecycle, 5-step integration walkthrough, gpu_poll.log format, failure/None handling table, and cross-references to ADDING_A_SUITE.md and threshold-kinds.md.
…in zip bundle Previously the log was written to a tempfile then uploaded to the NFS out_dir; because the zip plugin only bundles the local html report directory, the log never appeared in the run archive. Now it is written directly into the _test_html_dir folder (e.g. vllm_single_html/) so every run archive contains the poll log alongside the per-test HTML files. The NFS upload is kept for cluster-side access. Update gpu-metrics.md integration guide to match the correct log_path pattern and to describe where the log lands.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira
AIMVT-245 — GPU metrics polling integration for inference validation suites (DTNI epic AIMVT-202)
Background
CVS inference validation suites (e.g.
vllm_single) had no visibility into GPU-level resource utilisation during benchmark runs. There was no record of peak VRAM consumption, memory delta from model load, GPU compute activity, or memory bandwidth utilisation collected alongside throughput and latency metrics.This PR adds a GPU metrics polling capability that any inference validation suite can integrate. The reference integration is
vllm_single.Changes
New files
cvs/lib/utils/gpu.pycapture_gpu_metrics,poll_gpu_metrics,agg_readings,GPU_METRICS,GPU_METRIC_UNITS. Pure library, no import-time side-effects.cvs/lib/utils/unittests/test_gpu.pycvs/lib/utils/docs/gpu-metrics.mdgpu_poll.logformat, threshold JSON schema, failure/None handling table, and gotchas.Modified files
cvs/lib/utils/AGENTS.mdgpu.pysection: public API table, parameter table, 5-metric derivation table, required conftest fixtures, wiring patterns,pytest_generate_testsparametrize branch, and gotchas.cvs/lib/inference/vllm_single.pytest_vllm_inference: pre/post-load VRAM snapshots, model load timing, synchronouspoll_gpu_metricscall (client is backgrounded), andagg_readingsaggregation intoinf_res_dict. Addedtest_gpu_metric: readsgpu.*keys frominf_res_dict, surfaces each as an HTML row, gates against threshold whenenforce_thresholds=True.cvs/tests/inference/vllm/conftest.pygpu_metrics_snapmodule-scoped fixture. Addedtest_gpu_metricat rank 4 inpytest_collection_modifyitemssort table (omission caused it to run aftertest_teardown).cvs/tests/inference/vllm/_shared.pycvs/lib/inference/unittests/test_vllm_orch_parse.pycvs/input/config_file/inference/vllm_single/mi300x_vllm-single_llama31-70b_fp8_threshold.jsongpu.*threshold entries per sweep cell. Initial values are loose /enforce_thresholds: falsefor characterisation runs.The 5 derived metrics
gpu.peak_gpu_memory_mbgpu.model_load_memory_mbgpu.model_load_sgpu.gpu_bandwidth_util_pctgpu.gpu_compute_util_pctValidation
python -m unittest discover -s cvs/lib/utils/unittests -p "test_gpu.py"— all pass10.245.135.11(g21u31):vllm_singlewithgpu_poll_valconfig — 5 GPU metric rows visible in HTML report,gpu_poll.logwritten to run directoryAIMVT-245attachmentvllm_single_2026-06-25T124216.zipOut of scope