Skip to content

[CVS] GPU metrics polling integration for inference validation suites#241

Open
atnair-amd wants to merge 7 commits into
dev/dtnifrom
atnair/dtni-gpu-api
Open

[CVS] GPU metrics polling integration for inference validation suites#241
atnair-amd wants to merge 7 commits into
dev/dtnifrom
atnair/dtni-gpu-api

Conversation

@atnair-amd

Copy link
Copy Markdown
Collaborator

Jira

AIMVT-245 — GPU metrics polling integration for inference validation suites (DTNI epic AIMVT-202)


Background

CVS inference validation suites (e.g. vllm_single) had no visibility into GPU-level resource utilisation during benchmark runs. There was no record of peak VRAM consumption, memory delta from model load, GPU compute activity, or memory bandwidth utilisation collected alongside throughput and latency metrics.

This PR adds a GPU metrics polling capability that any inference validation suite can integrate. The reference integration is vllm_single.


Changes

New files

File Description
cvs/lib/utils/gpu.py GPU metrics library: capture_gpu_metrics, poll_gpu_metrics, agg_readings, GPU_METRICS, GPU_METRIC_UNITS. Pure library, no import-time side-effects.
cvs/lib/utils/unittests/test_gpu.py 66 implementation-blind unit tests covering zero-value guards, N/A degradation, partial entry exclusion, multi-GPU aggregation, multi-host pooling, and failure/recovery cycling.
cvs/lib/utils/docs/gpu-metrics.md User-facing integration guide: 5-step walkthrough, two wiring patterns (sync poll / threaded poll), gpu_poll.log format, threshold JSON schema, failure/None handling table, and gotchas.

Modified files

File Change
cvs/lib/utils/AGENTS.md Added gpu.py section: public API table, parameter table, 5-metric derivation table, required conftest fixtures, wiring patterns, pytest_generate_tests parametrize branch, and gotchas.
cvs/lib/inference/vllm_single.py Added GPU polling to test_vllm_inference: pre/post-load VRAM snapshots, model load timing, synchronous poll_gpu_metrics call (client is backgrounded), and agg_readings aggregation into inf_res_dict. Added test_gpu_metric: reads gpu.* keys from inf_res_dict, surfaces each as an HTML row, gates against threshold when enforce_thresholds=True.
cvs/tests/inference/vllm/conftest.py Added gpu_metrics_snap module-scoped fixture. Added test_gpu_metric at rank 4 in pytest_collection_modifyitems sort table (omission caused it to run after test_teardown).
cvs/tests/inference/vllm/_shared.py Minor additions to support GPU metric surfacing.
cvs/lib/inference/unittests/test_vllm_orch_parse.py Updated unit tests to cover new GPU metric keys.
cvs/input/config_file/inference/vllm_single/mi300x_vllm-single_llama31-70b_fp8_threshold.json Added 5 gpu.* threshold entries per sweep cell. Initial values are loose / enforce_thresholds: false for characterisation runs.

The 5 derived metrics

Key Unit Aggregation
gpu.peak_gpu_memory_mb MB max over polls, each poll summed across GPUs
gpu.model_load_memory_mb MB post-load minus pre-load VRAM snapshot
gpu.model_load_s s wall-clock elapsed while server starts
gpu.gpu_bandwidth_util_pct % mean UMC activity over polls, averaged across GPUs
gpu.gpu_compute_util_pct % mean GFX activity over polls, averaged across GPUs

Validation

  • 66 unit tests: python -m unittest discover -s cvs/lib/utils/unittests -p "test_gpu.py" — all pass
  • End-to-end run on core42 node 10.245.135.11 (g21u31): vllm_single with gpu_poll_val config — 5 GPU metric rows visible in HTML report, gpu_poll.log written to run directory
  • Run artefacts: AIMVT-245 attachment vllm_single_2026-06-25T124216.zip

Out of scope

  • Multi-node GPU aggregation (single head-node polling only in v1)
  • Per-GPU metric breakdown (cluster-level aggregates only)
  • Energy / power metrics in the threshold gate (collected in raw snapshots but not surfaced as HTML rows)
  • SGLang, InferenceX, or other inference suite integrations (follow-on stories)

- gpu.py: parse_gpu_metrics, capture_gpu_metrics, _mean, agg_readings, poll_gpu_metrics
- VllmJob.is_client_done(): non-raising completion predicate
- vllm_single test: poll GPU while client runs, write gpu_poll.log, derive 5 metrics
- _shared.py: Peak VRAM / Compute % / BW % columns in results table
- test_gpu.py: TestMean, TestAggReadings, TestPollGpuMetrics unit test classes
- threshold JSON: gpu.* placeholder SLO entries for all 5 cells
- test_vllm_orch_parse: update threshold path + exclude gpu.* from client key guard
The fixture was referenced in test_vllm_inference's parameter list but
never defined, causing a setup Error before any inference ran.
amd-smi is a host-side tool — running it via orch.exec() sends it into
the container where it doesn't exist. Switch capture_gpu_metrics to
orch.exec_on_head() so the command runs on the bare-metal node.

Also ensure the out_dir exists before poll_gpu_metrics attempts to write
gpu_poll.log, since the directory is created lazily by the job setup.

Update unit test mocks from exec to exec_on_head to match.
…c_on_head

out_dir is an NFS path on the node, not mounted on the devbox.
Write the log to a local tempdir, then base64-encode it and push
it to the node via exec_on_head so it lands in the bundle.
…rank

Move import time/logging/pathlib from inside poll_gpu_metrics body to
module top-level. Add test_gpu_metric at rank 4 in conftest sort table
so it runs before test_teardown, not after.
Add gpu.py API reference to cvs/lib/utils/AGENTS.md: public symbols,
poll_gpu_metrics parameter table, 5-metric derivation table, required
conftest fixtures (gpu_metrics_snap), two wiring patterns (sync poll /
threaded poll), pytest_generate_tests parametrize branch, collection
sort rank table, and gotchas (threshold key prefix, capture can raise,
or-None semantics, full actuals for evaluate_all, GATED_METRICS).

Add cvs/lib/utils/docs/gpu-metrics.md: user-facing integration guide
covering the 5 derived metrics, polling lifecycle, 5-step integration
walkthrough, gpu_poll.log format, failure/None handling table, and
cross-references to ADDING_A_SUITE.md and threshold-kinds.md.
…in zip bundle

Previously the log was written to a tempfile then uploaded to the NFS out_dir;
because the zip plugin only bundles the local html report directory, the log
never appeared in the run archive. Now it is written directly into the _test_html_dir
folder (e.g. vllm_single_html/) so every run archive contains the poll log alongside
the per-test HTML files. The NFS upload is kept for cluster-side access.

Update gpu-metrics.md integration guide to match the correct log_path pattern
and to describe where the log lands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant