[CVS] GPU metrics polling integration for inference validation suites by atnair-amd · Pull Request #241 · ROCm/cvs

atnair-amd · 2026-06-25T19:27:39Z

Jira

AIMVT-245 — GPU metrics polling integration for inference validation suites (DTNI epic AIMVT-202)

Background

CVS inference validation suites (e.g. vllm_single) had no visibility into GPU-level resource utilisation during benchmark runs. There was no record of peak VRAM consumption, memory delta from model load, GPU compute activity, or memory bandwidth utilisation collected alongside throughput and latency metrics.

This PR adds a GPU metrics polling capability that any inference validation suite can integrate. The reference integration is vllm_single.

Changes

New files

File	Description
`cvs/lib/utils/gpu.py`	GPU metrics library: `capture_gpu_metrics`, `poll_gpu_metrics`, `agg_readings`, `GPU_METRICS`, `GPU_METRIC_UNITS`. Pure library, no import-time side-effects.
`cvs/lib/utils/unittests/test_gpu.py`	66 implementation-blind unit tests covering zero-value guards, N/A degradation, partial entry exclusion, multi-GPU aggregation, multi-host pooling, and failure/recovery cycling.
`cvs/lib/utils/docs/gpu-metrics.md`	User-facing integration guide: 5-step walkthrough, two wiring patterns (sync poll / threaded poll), `gpu_poll.log` format, threshold JSON schema, failure/None handling table, and gotchas.

Modified files

File	Change
`cvs/lib/utils/AGENTS.md`	Added `gpu.py` section: public API table, parameter table, 5-metric derivation table, required conftest fixtures, wiring patterns, `pytest_generate_tests` parametrize branch, and gotchas.
`cvs/lib/inference/vllm_single.py`	Added GPU polling to `test_vllm_inference`: pre/post-load VRAM snapshots, model load timing, synchronous `poll_gpu_metrics` call (client is backgrounded), and `agg_readings` aggregation into `inf_res_dict`. Added `test_gpu_metric`: reads `gpu.*` keys from `inf_res_dict`, surfaces each as an HTML row, gates against threshold when `enforce_thresholds=True`.
`cvs/tests/inference/vllm/conftest.py`	Added `gpu_metrics_snap` module-scoped fixture. Added `test_gpu_metric` at rank 4 in `pytest_collection_modifyitems` sort table (omission caused it to run after `test_teardown`).
`cvs/tests/inference/vllm/_shared.py`	Minor additions to support GPU metric surfacing.
`cvs/lib/inference/unittests/test_vllm_orch_parse.py`	Updated unit tests to cover new GPU metric keys.
`cvs/input/config_file/inference/vllm_single/mi300x_vllm-single_llama31-70b_fp8_threshold.json`	Added 5 `gpu.*` threshold entries per sweep cell. Initial values are loose / `enforce_thresholds: false` for characterisation runs.

The 5 derived metrics

Key	Unit	Aggregation
`gpu.peak_gpu_memory_mb`	MB	max over polls, each poll summed across GPUs
`gpu.model_load_memory_mb`	MB	post-load minus pre-load VRAM snapshot
`gpu.model_load_s`	s	wall-clock elapsed while server starts
`gpu.gpu_bandwidth_util_pct`	%	mean UMC activity over polls, averaged across GPUs
`gpu.gpu_compute_util_pct`	%	mean GFX activity over polls, averaged across GPUs

Validation

66 unit tests: python -m unittest discover -s cvs/lib/utils/unittests -p "test_gpu.py" — all pass
End-to-end run on core42 node 10.245.135.11 (g21u31): vllm_single with gpu_poll_val config — 5 GPU metric rows visible in HTML report, gpu_poll.log written to run directory
Run artefacts: AIMVT-245 attachment vllm_single_2026-06-25T124216.zip

Out of scope

Multi-node GPU aggregation (single head-node polling only in v1)
Per-GPU metric breakdown (cluster-level aggregates only)
Energy / power metrics in the threshold gate (collected in raw snapshots but not surfaced as HTML rows)
SGLang, InferenceX, or other inference suite integrations (follow-on stories)

- gpu.py: parse_gpu_metrics, capture_gpu_metrics, _mean, agg_readings, poll_gpu_metrics - VllmJob.is_client_done(): non-raising completion predicate - vllm_single test: poll GPU while client runs, write gpu_poll.log, derive 5 metrics - _shared.py: Peak VRAM / Compute % / BW % columns in results table - test_gpu.py: TestMean, TestAggReadings, TestPollGpuMetrics unit test classes - threshold JSON: gpu.* placeholder SLO entries for all 5 cells - test_vllm_orch_parse: update threshold path + exclude gpu.* from client key guard

The fixture was referenced in test_vllm_inference's parameter list but never defined, causing a setup Error before any inference ran.

amd-smi is a host-side tool — running it via orch.exec() sends it into the container where it doesn't exist. Switch capture_gpu_metrics to orch.exec_on_head() so the command runs on the bare-metal node. Also ensure the out_dir exists before poll_gpu_metrics attempts to write gpu_poll.log, since the directory is created lazily by the job setup. Update unit test mocks from exec to exec_on_head to match.

…c_on_head out_dir is an NFS path on the node, not mounted on the devbox. Write the log to a local tempdir, then base64-encode it and push it to the node via exec_on_head so it lands in the bundle.

…rank Move import time/logging/pathlib from inside poll_gpu_metrics body to module top-level. Add test_gpu_metric at rank 4 in conftest sort table so it runs before test_teardown, not after.

Add gpu.py API reference to cvs/lib/utils/AGENTS.md: public symbols, poll_gpu_metrics parameter table, 5-metric derivation table, required conftest fixtures (gpu_metrics_snap), two wiring patterns (sync poll / threaded poll), pytest_generate_tests parametrize branch, collection sort rank table, and gotchas (threshold key prefix, capture can raise, or-None semantics, full actuals for evaluate_all, GATED_METRICS). Add cvs/lib/utils/docs/gpu-metrics.md: user-facing integration guide covering the 5 derived metrics, polling lifecycle, 5-step integration walkthrough, gpu_poll.log format, failure/None handling table, and cross-references to ADDING_A_SUITE.md and threshold-kinds.md.

…in zip bundle Previously the log was written to a tempfile then uploaded to the NFS out_dir; because the zip plugin only bundles the local html report directory, the log never appeared in the run archive. Now it is written directly into the _test_html_dir folder (e.g. vllm_single_html/) so every run archive contains the poll log alongside the per-test HTML files. The NFS upload is kept for cluster-side access. Update gpu-metrics.md integration guide to match the correct log_path pattern and to describe where the log lands.

atnair-amd added 6 commits June 24, 2026 19:21

fix(vllm_single): add missing gpu_metrics_snap module-scope fixture

c555c51

The fixture was referenced in test_vllm_inference's parameter list but never defined, causing a setup Error before any inference ran.

fix(vllm_single): write gpu_poll.log to tmp then copy to node via exe…

d08063f

…c_on_head out_dir is an NFS path on the node, not mounted on the devbox. Write the log to a local tempdir, then base64-encode it and push it to the node via exec_on_head so it lands in the bundle.

fix(gpu): move deferred imports to module level; fix test_gpu_metric …

717284f

…rank Move import time/logging/pathlib from inside poll_gpu_metrics body to module top-level. Add test_gpu_metric at rank 4 in conftest sort table so it runs before test_teardown, not after.

atnair-amd self-assigned this Jun 25, 2026

atnair-amd requested review from amd-droy, anujmittal-amd, hnimra-amd, solaiys and sukesh-amd June 25, 2026 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CVS] GPU metrics polling integration for inference validation suites#241

[CVS] GPU metrics polling integration for inference validation suites#241
atnair-amd wants to merge 7 commits into
dev/dtnifrom
atnair/dtni-gpu-api

atnair-amd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

atnair-amd commented Jun 25, 2026

Jira

Background

Changes

New files

Modified files

The 5 derived metrics

Validation

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant