Hnimrama/inferencemax uplift restore by hnimra-amd · Pull Request #229 · ROCm/cvs

hnimra-amd · 2026-06-18T05:01:03Z

Summary

Restores the InferenceMax / DTNI uplift from Hnimrama/inferencemax uplift #225 after it was reverted by Revert "Hnimrama/inferencemax uplift" #228.
Adds a follow-up fix so the vLLM benchmark client runs with the same Python that imports vllm (ROCm images where python3 is not the vLLM interpreter), and hardens docker exec / bash -c quoting.

Test plan

cvs run inferencemax_single (or your usual InferenceMax path) on MI300x + current vLLM container.

…inery Rename cvs/lib/dtni to cvs/lib/utils ("utils" says what it is: pure functions any lib can call; "dtni" was a leftover project codename). verdict.py moves unchanged. config_loader.py is trimmed to the framework-agnostic half: the paths/model/image/container schema, the 3-pass placeholder substitution, the enforce_thresholds gate on a new BaseVariantConfig, and a substitute_config() helper (file read + substitution + sibling-threshold discovery). The inference-only schema moves to a sibling module in a later commit.

The client.* metric parser (to_client_metrics + CLIENT_METRICS surface) is inference-specific and should not sit in the shared utils dir. Move it under a new cvs/lib/inference/utils package. Content unchanged.

The inference half of the old config_loader: GoodputSlo, SeqCombo, Sweep, Params, Roles, and VariantConfig(BaseVariantConfig) with cell_key and the threshold-coverage check. load_variant() delegates the file read and placeholder substitution to the generic substitute_config(). Replaces the sequence_combinations x concurrency_levels cartesian with a named-combo + explicit runs[] selector: each run is a {combo, concurrency} pair, so the config enumerates exactly the cells to run (no NxM explosion). A model_validator rejects duplicate combo names and runs referencing an unknown combo at load time.

…rsing build_server_cmd/start_server assemble a `vllm serve` arg list in Python (mirroring run_client) instead of cloning and running an external .sh, so a run needs no hand-staged script. --max-model-len is derived per cell from isl/osl/random_range_ratio so any sweep change stays self-consistent. parse_results reads the stock extensionless `results` JSON artifact that `vllm bench serve` writes to --result-dir and delegates namespacing + derived-metric math to vllm_parsing.to_client_metrics, replacing the brittle console-log regex table. Missing/empty/unparseable artifacts hard-fail the cell rather than recording a silently-green empty row. Adds the optional --goodput SLO gate (per-cell, omitted when no SLO) and threads goodput_slo through run().

pytest_generate_tests now drives parametrization from the named-combo + runs[] selector instead of the cartesian. test_vllm_inference only runs the benchmark and stashes results; the verdict moves into a new test_metric (one pytest test = one HTML row per metric per cell), with inline Value/Unit columns added via pytest_html hooks in conftest. test_setup_sshd gates its 2224 probe on len(orch.hosts) > 1, mirroring the single-node orchestrator guard: single-node runs skip the in-container sshd (it exists only for inter-node MPI) and must not probe for it. Import paths follow the dtni -> utils / inference.utils moves.

…luster file Rename the config/threshold pair to {model}_{precision}_{config|threshold} .json and convert the sweep to the named-combo + runs[] selector. Pin the image to rocm/vllm-dev:nightly (the previously pinned :nightly-sshd tag does not exist on Docker Hub, and single-node runs skip in-container sshd). Delete cvs/input/cluster_file/mi300x_vllm_single.json: a cluster file only needs node IP + user/key/orchestrator; the variant config supplies the container block, so the bespoke per-suite cluster file is redundant.

…uards Cover to_client_metrics purity + derived metrics, the named-combo/runs sweep selector (expansion, unknown-combo and duplicate-name rejection), the run_client goodput/metric-percentiles flags, table-cell rendering, and the verdict None-guards. Adds JSON fixtures for the stock results artifact.

Add a human reference guide (plans/building-a-cvs-test-suite.md) that walks the six-layer suite architecture using vllm_single as the worked example: the generic <-> framework config seam, the named-combo + runs[] sweep selector, the self-contained Python-built server cmd, lifecycle-as-tests, and a checklist for authoring a new inference or training suite. Add per-package AGENTS.md docs naming the public entry points, the seam, and the non-obvious gotchas: - cvs/lib/utils: substitute_config / BaseVariantConfig / evaluate_all, the 3-pass placeholder order, sibling-glob threshold discovery, parent-first validator ordering. - cvs/lib/inference/utils: load_variant / to_client_metrics / CLIENT_METRICS, the cell_key single-source-of-truth, the coverage check that prevents a silent green, and the validators mirrored in pytest_generate_tests.

… rules Document the dtni -> utils rename and the shared (cvs/lib/utils) vs domain-specific (cvs/lib/inference/utils) split: what lives where, the rule for placing a new helper, and the directory map. Note training is not yet ported and this guide is the blueprint for that port. Remove the manual --- horizontal rules between sections: heading levels already render their own bottom border, so the extra rules produced a double-underline. Minor prose/format cleanups.

The threshold file carried placeholder CONC=64/128/256 entries while the sweep's only run is concurrency 16, so cell_key() matched no threshold. The mismatch was masked by enforce_thresholds=false (warned, not raised) and would have failed load the instant enforcement was flipped on. Re-key to the single CONC=16 cell the runs selector actually enumerates.

MODEL/ISL/OSL/MAX_MODEL_LEN/RANDOM_RANGE_RATIO/TP/CONC/PORT were exported into /tmp/server_env_script.sh but read by nothing after the .sh->Python server command refactor -- both _server_argv and run_client pass these as explicit flags. Keep only the env the vllm process actually consumes (HF token, HF cache pin, AITER flags). Also drops the second _derive_max_model_len call that fed the dead MAX_MODEL_LEN export.

_server_argv hard-coded --kv-cache-dtype fp8, baking a per-model property into the shared orchestrator -- a non-fp8-KV model dropped into the suite would be served wrong with no config recourse (extra_serve_args can only add, so an override would pass the flag twice). Declare it in the W1 config's roles.server.extra_serve_args instead; the driver stays model-agnostic and 'new model = new config' holds.

…nd collection pytest_generate_tests hand-reimplemented the duplicate-name and unknown-run.combo checks that Sweep._check_runs_reference_known_combos already enforces, with divergent semantics (first-failure raise vs all-at-once). Extract validate_sweep_selector() as the single home and call it from both the typed validator (load time) and the collection-time raw-JSON path so the rule can't drift.

out_dir was fixed per job, so a multi-cell sweep would overwrite each cell's `results` and client.log, and parse_results could cat a prior cell's stale artifact if the current cell's client failed to write one. Key it by cell. Latent today (the shipped sweep has one cell).

run_client accepted goodput_slo as either a dict (.get) or an object (getattr) via a per-key hasattr branch, but the only production caller passes a raw dict -- the object path existed solely for a unit test, and the dual path meant the typed GoodputSlo's validation never reached the command builder. Consume the dict only and drop the object-form test.

goodput_slo_unused was never read (goodput is threaded through _make_job).

The committed vllm_single example config carried a personal hf-token path and a concrete image tag (duplicated in image.tag and container.image). Replace all three with <changeme> so the file is a template a new user must fill in -- it still loads (collection works) and only fails at run time when an unedited value is read, which is the intended signal.

The per-model server knobs were a flat [flag, value, flag, value] list (roles.server.extra_serve_args), which reads poorly. Replace with a roles.server.serve_args {flag: value} map (flag without the leading --): a scalar renders --flag value, True a bare --flag, a list the flag once per element -- so it stays readable while still covering vllm bare/repeatable flags. _server_argv flattens the map via a new _flatten_serve_args helper; the derived flags (tp/max-model-len/port) stay code-built. Also repoints a stale unit test that asserted MAX_MODEL_LEN in the env script (it moved to the --max-model-len flag in an earlier commit) to assert against the server argv instead.

The only ceiling kind was max_ms, whose message hard-codes ms. A count metric like client.failed needs an upper bound without the unit lie; add a plain max with the same comparison and an honest message.

Previously only cell-presence was validated: a cell could exist while a given metric had no spec, and test_metric (spec is None -> return) would silently report a green record-only row even under enforce_thresholds=true. A new perf metric was thus unvalidated by default. Declare GATED_METRICS beside CLIENT_METRICS -- the perf+health subset that must assert (throughput, mean+p99 latency, success_rate/failed) -- and extend _check_thresholds_cover_sweep to require a spec for every gated metric in every present cell, reusing the same enforce-vs-warn path. A new metric is record-only until added to the set; once gated, the loader forces a spec in every cell before the suite can run green. Inputs, totals, and derived diagnostics stay record-only by design.

The image was declared twice: top-level image.tag (live -- conftest copied it onto the container block) and container.image (dead -- overwritten by that copy). The duplicate forced a top-level image block whose remote field was unused and whose tag silently shadowed container.image. Make container.image the single source: drop the top-level ImageSpec block from the generic BaseVariantConfig, drop the conftest overwrite so the merged container.image is used as-is, and remove the now-schemaless image block from the example config.

Expand GATED_METRICS from the mean+p99 subset to every emitted latency quantile (mean/median/p90/p95/p99) for ttft, tpot, itl, and e2el -- itl omits p90 as CLIENT_METRICS has no producer for it. Throughput and success_rate/failed health are unchanged. Inputs, totals, secondary throughputs, and derived diagnostics stay record-only. The example threshold file gains a placeholder spec for each newly gated metric (23 total) so the loader gated-coverage check passes.

- utils/AGENTS.md: drop top-level image (now container.image); add max verdict kind - inference/utils/AGENTS.md: document GATED_METRICS contract + dual-axis coverage check - building-a-cvs-test-suite.md: container.image, serve_args map, max kind, GATED_METRICS - dtni-dev-guide.md: SUPERSEDED banner pointing to building guide + AGENTS.md

Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.

Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.

Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.

Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.

Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling.

vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.

Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.

…st polling Gate benchmark success on Failed requests only after the summary is present; tail more client log lines for InferenceMax. Variant and benchmark_params accept bench_max_failed_requests (default 0 remains strict for CI).

Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.

…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.

…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.

…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.

Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.

Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.

…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.

… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.

…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.

Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.

InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.

…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.

Verify stock results artifact maps to client.* metrics via FakeOrch.

amd-droy · 2026-06-23T19:51:42Z

+    if not orch.verify_containers_running(name):
+        lifecycle.failed = True
+        pytest.fail(f"container {name} not running after setup_containers()")
+    time.sleep(30)


will you need this gap of 30 secs after the testcase?

amd-droy · 2026-06-23T19:54:11Z

+
+
+def _du_bytes(orch, path):
+    out = orch.exec(f"bash -c {shlex.quote(f'du -sb {shlex.quote(path)} 2>/dev/null | cut -f1')}")


The 2>/dev/null silently swallows errors, and the function returns 0 on any failure. Since test_model_fetch uses the return value of 0 to detect "no model present," a network failure, permission error, or du not being available would be indistinguishable from a genuinely missing model

amd-droy · 2026-06-23T19:56:57Z

+        lifecycle.record(request.node.nodeid, "server_ready", time.monotonic() - t)
+        t_client = time.monotonic()
+        job.run_client()
+        job.wait_client_complete()


better to have a timeout here. If the inference client hangs, this will block indefinitely with no recovery path.

amd-droy · 2026-06-23T19:59:49Z

+        return fp.read().strip()
+
+
+@pytest.fixture(scope="session")


will the scope be session or module?

anujmittal-amd

Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this.
Same comment for single/distributed; do we need a separate folder for single/multi?
dtni folder should be renamed based on vLLM structure.

Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.

Thanks

hnimra-amd · 2026-06-24T00:57:28Z

Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this. Same comment for single/distributed; do we need a separate folder for single/multi? dtni folder should be renamed based on vLLM structure.

Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.

Thanks

@anujmittal-amd I addressed the structural feedback on the IX-atom branch, this is the PR for it: #238
and it builds on the current PR

Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.

hnimra-amd requested review from amd-droy, anujmittal-amd and atnair-amd June 18, 2026 06:55

atnair-amd and others added 27 commits June 18, 2026 21:07

refactor(inference): move vllm_parsing into cvs/lib/inference/utils

c0e7413

The client.* metric parser (to_client_metrics + CLIENT_METRICS surface) is inference-specific and should not sit in the shared utils dir. Move it under a new cvs/lib/inference/utils package. Content unchanged.

docs: demote headings so GitHub stops underlining sections

00c9cdc

removing old plan

94d2081

test(inference): drop unused _fake_variant parameter

6433f86

goodput_slo_unused was never read (goodput is threaded through _make_job).

feat(verdict): add unit-agnostic max threshold kind

f217c9d

The only ceiling kind was max_ms, whose message hard-codes ms. A count metric like client.failed needs an upper bound without the unit lie; add a plain max with the same comparison and an honest message.

Restore InferenceMax uplift reverted by #228

b553fcc

Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.

fix(inference): run vLLM bench client with vLLM interpreter

1f1ffdc

Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.

hnimra-amd added 6 commits June 23, 2026 10:55

fix(dtni): broaden vLLM benchmark script discovery

81cadd3

Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.

fix(inference): harden InferenceMax server startup and GPU mem env

abe838c

Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.

fix(dtni): clamp bench random-range to max_model_length

2ac3957

vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.

fix(inference): extend InferenceMax bench client poll budget

7203562

Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.

hnimra-amd force-pushed the hnimrama/inferencemax-uplift-restore branch from 7e9bc30 to 730b408 Compare June 23, 2026 18:36

hnimra-amd added 13 commits June 23, 2026 11:45

feat(inference): add typed InferenceMax config loader (Phase 1)

65b5ec4

Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.

feat(inference): migrate InferenceMax configs to schema_version 1 (Ph…

718e066

…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.

test(inference): wire inferencemax_single to typed config and sweep (…

6062312

…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.

docs(inference): update InferenceMax config reference for schema_vers…

26aba03

…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.

docs: fix stale dtni.config_loader references (Phase 1 tail)

44bb053

Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.

feat(inference): rewrite InferenceMaxJob like VllmJob (Phase 3)

246250b

Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.

feat(inference): move InferenceMax server flags to roles.server.serve…

79969ba

…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.

test(inference): align inferencemax_single suite with VllmJob pattern…

8bf942d

… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.

docs(inference): update InferenceMax reference for Phase 3 driver (Ph…

17b902f

…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.

chore(inference): remove unused inferencemax_host_scripts (Phase 5)

3838d98

Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.

docs: clarify vllm_benchmark_scripts are legacy-only (Phase 5)

cdeef6a

InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.

docs(inference): rewrite InferenceMax reference for schema_version 1 …

ba93dfb

…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.

test(inference): add InferenceMaxJob parse_results unit test (Phase 5)

0e39316

Verify stock results artifact maps to client.* metrics via FakeOrch.

amd-droy reviewed Jun 23, 2026

View reviewed changes

anujmittal-amd requested changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hnimrama/inferencemax uplift restore#229

Hnimrama/inferencemax uplift restore#229
hnimra-amd wants to merge 46 commits into
dev/dtnifrom
hnimrama/inferencemax-uplift-restore

hnimra-amd commented Jun 18, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

anujmittal-amd left a comment

Uh oh!

hnimra-amd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def _du_bytes(orch, path):
		out = orch.exec(f"bash -c {shlex.quote(f'du -sb {shlex.quote(path)} 2>/dev/null \| cut -f1')}")

Uh oh!

Conversation

hnimra-amd commented Jun 18, 2026

Summary

Test plan

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

anujmittal-amd left a comment

Choose a reason for hiding this comment

Uh oh!

hnimra-amd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants