Hnimrama/inferencemax uplift restore#229
Conversation
…inery
Rename cvs/lib/dtni to cvs/lib/utils ("utils" says what it is: pure
functions any lib can call; "dtni" was a leftover project codename).
verdict.py moves unchanged.
config_loader.py is trimmed to the framework-agnostic half: the
paths/model/image/container schema, the 3-pass placeholder substitution,
the enforce_thresholds gate on a new BaseVariantConfig, and a
substitute_config() helper (file read + substitution + sibling-threshold
discovery). The inference-only schema moves to a sibling module in a
later commit.
The client.* metric parser (to_client_metrics + CLIENT_METRICS surface) is inference-specific and should not sit in the shared utils dir. Move it under a new cvs/lib/inference/utils package. Content unchanged.
The inference half of the old config_loader: GoodputSlo, SeqCombo, Sweep,
Params, Roles, and VariantConfig(BaseVariantConfig) with cell_key and the
threshold-coverage check. load_variant() delegates the file read and
placeholder substitution to the generic substitute_config().
Replaces the sequence_combinations x concurrency_levels cartesian with a
named-combo + explicit runs[] selector: each run is a {combo, concurrency}
pair, so the config enumerates exactly the cells to run (no NxM explosion).
A model_validator rejects duplicate combo names and runs referencing an
unknown combo at load time.
…rsing build_server_cmd/start_server assemble a `vllm serve` arg list in Python (mirroring run_client) instead of cloning and running an external .sh, so a run needs no hand-staged script. --max-model-len is derived per cell from isl/osl/random_range_ratio so any sweep change stays self-consistent. parse_results reads the stock extensionless `results` JSON artifact that `vllm bench serve` writes to --result-dir and delegates namespacing + derived-metric math to vllm_parsing.to_client_metrics, replacing the brittle console-log regex table. Missing/empty/unparseable artifacts hard-fail the cell rather than recording a silently-green empty row. Adds the optional --goodput SLO gate (per-cell, omitted when no SLO) and threads goodput_slo through run().
pytest_generate_tests now drives parametrization from the named-combo + runs[] selector instead of the cartesian. test_vllm_inference only runs the benchmark and stashes results; the verdict moves into a new test_metric (one pytest test = one HTML row per metric per cell), with inline Value/Unit columns added via pytest_html hooks in conftest. test_setup_sshd gates its 2224 probe on len(orch.hosts) > 1, mirroring the single-node orchestrator guard: single-node runs skip the in-container sshd (it exists only for inter-node MPI) and must not probe for it. Import paths follow the dtni -> utils / inference.utils moves.
…luster file
Rename the config/threshold pair to {model}_{precision}_{config|threshold}
.json and convert the sweep to the named-combo + runs[] selector. Pin the
image to rocm/vllm-dev:nightly (the previously pinned :nightly-sshd tag
does not exist on Docker Hub, and single-node runs skip in-container sshd).
Delete cvs/input/cluster_file/mi300x_vllm_single.json: a cluster file only
needs node IP + user/key/orchestrator; the variant config supplies the
container block, so the bespoke per-suite cluster file is redundant.
…uards Cover to_client_metrics purity + derived metrics, the named-combo/runs sweep selector (expansion, unknown-combo and duplicate-name rejection), the run_client goodput/metric-percentiles flags, table-cell rendering, and the verdict None-guards. Adds JSON fixtures for the stock results artifact.
Add a human reference guide (plans/building-a-cvs-test-suite.md) that walks
the six-layer suite architecture using vllm_single as the worked example:
the generic <-> framework config seam, the named-combo + runs[] sweep
selector, the self-contained Python-built server cmd, lifecycle-as-tests,
and a checklist for authoring a new inference or training suite.
Add per-package AGENTS.md docs naming the public entry points, the seam,
and the non-obvious gotchas:
- cvs/lib/utils: substitute_config / BaseVariantConfig / evaluate_all,
the 3-pass placeholder order, sibling-glob threshold discovery,
parent-first validator ordering.
- cvs/lib/inference/utils: load_variant / to_client_metrics / CLIENT_METRICS,
the cell_key single-source-of-truth, the coverage check that prevents a
silent green, and the validators mirrored in pytest_generate_tests.
… rules Document the dtni -> utils rename and the shared (cvs/lib/utils) vs domain-specific (cvs/lib/inference/utils) split: what lives where, the rule for placing a new helper, and the directory map. Note training is not yet ported and this guide is the blueprint for that port. Remove the manual --- horizontal rules between sections: heading levels already render their own bottom border, so the extra rules produced a double-underline. Minor prose/format cleanups.
The threshold file carried placeholder CONC=64/128/256 entries while the sweep's only run is concurrency 16, so cell_key() matched no threshold. The mismatch was masked by enforce_thresholds=false (warned, not raised) and would have failed load the instant enforcement was flipped on. Re-key to the single CONC=16 cell the runs selector actually enumerates.
MODEL/ISL/OSL/MAX_MODEL_LEN/RANDOM_RANGE_RATIO/TP/CONC/PORT were exported into /tmp/server_env_script.sh but read by nothing after the .sh->Python server command refactor -- both _server_argv and run_client pass these as explicit flags. Keep only the env the vllm process actually consumes (HF token, HF cache pin, AITER flags). Also drops the second _derive_max_model_len call that fed the dead MAX_MODEL_LEN export.
_server_argv hard-coded --kv-cache-dtype fp8, baking a per-model property into the shared orchestrator -- a non-fp8-KV model dropped into the suite would be served wrong with no config recourse (extra_serve_args can only add, so an override would pass the flag twice). Declare it in the W1 config's roles.server.extra_serve_args instead; the driver stays model-agnostic and 'new model = new config' holds.
…nd collection pytest_generate_tests hand-reimplemented the duplicate-name and unknown-run.combo checks that Sweep._check_runs_reference_known_combos already enforces, with divergent semantics (first-failure raise vs all-at-once). Extract validate_sweep_selector() as the single home and call it from both the typed validator (load time) and the collection-time raw-JSON path so the rule can't drift.
out_dir was fixed per job, so a multi-cell sweep would overwrite each cell's `results` and client.log, and parse_results could cat a prior cell's stale artifact if the current cell's client failed to write one. Key it by cell. Latent today (the shipped sweep has one cell).
run_client accepted goodput_slo as either a dict (.get) or an object (getattr) via a per-key hasattr branch, but the only production caller passes a raw dict -- the object path existed solely for a unit test, and the dual path meant the typed GoodputSlo's validation never reached the command builder. Consume the dict only and drop the object-form test.
goodput_slo_unused was never read (goodput is threaded through _make_job).
The committed vllm_single example config carried a personal hf-token path and a concrete image tag (duplicated in image.tag and container.image). Replace all three with <changeme> so the file is a template a new user must fill in -- it still loads (collection works) and only fails at run time when an unedited value is read, which is the intended signal.
The per-model server knobs were a flat [flag, value, flag, value] list
(roles.server.extra_serve_args), which reads poorly. Replace with a
roles.server.serve_args {flag: value} map (flag without the leading --):
a scalar renders --flag value, True a bare --flag, a list the flag once per
element -- so it stays readable while still covering vllm bare/repeatable
flags. _server_argv flattens the map via a new _flatten_serve_args helper;
the derived flags (tp/max-model-len/port) stay code-built.
Also repoints a stale unit test that asserted MAX_MODEL_LEN in the env
script (it moved to the --max-model-len flag in an earlier commit) to assert
against the server argv instead.
The only ceiling kind was max_ms, whose message hard-codes ms. A count metric like client.failed needs an upper bound without the unit lie; add a plain max with the same comparison and an honest message.
Previously only cell-presence was validated: a cell could exist while a given metric had no spec, and test_metric (spec is None -> return) would silently report a green record-only row even under enforce_thresholds=true. A new perf metric was thus unvalidated by default. Declare GATED_METRICS beside CLIENT_METRICS -- the perf+health subset that must assert (throughput, mean+p99 latency, success_rate/failed) -- and extend _check_thresholds_cover_sweep to require a spec for every gated metric in every present cell, reusing the same enforce-vs-warn path. A new metric is record-only until added to the set; once gated, the loader forces a spec in every cell before the suite can run green. Inputs, totals, and derived diagnostics stay record-only by design.
The image was declared twice: top-level image.tag (live -- conftest copied it onto the container block) and container.image (dead -- overwritten by that copy). The duplicate forced a top-level image block whose remote field was unused and whose tag silently shadowed container.image. Make container.image the single source: drop the top-level ImageSpec block from the generic BaseVariantConfig, drop the conftest overwrite so the merged container.image is used as-is, and remove the now-schemaless image block from the example config.
Expand GATED_METRICS from the mean+p99 subset to every emitted latency quantile (mean/median/p90/p95/p99) for ttft, tpot, itl, and e2el -- itl omits p90 as CLIENT_METRICS has no producer for it. Throughput and success_rate/failed health are unchanged. Inputs, totals, secondary throughputs, and derived diagnostics stay record-only. The example threshold file gains a placeholder spec for each newly gated metric (23 total) so the loader gated-coverage check passes.
- utils/AGENTS.md: drop top-level image (now container.image); add max verdict kind - inference/utils/AGENTS.md: document GATED_METRICS contract + dual-axis coverage check - building-a-cvs-test-suite.md: container.image, serve_args map, max kind, GATED_METRICS - dtni-dev-guide.md: SUPERSEDED banner pointing to building guide + AGENTS.md
Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.
Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.
Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.
Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling.
vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.
Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.
…st polling Gate benchmark success on Failed requests only after the summary is present; tail more client log lines for InferenceMax. Variant and benchmark_params accept bench_max_failed_requests (default 0 remains strict for CI).
7e9bc30 to
730b408
Compare
Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.
…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.
…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.
…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.
Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.
Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.
…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.
… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.
…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.
Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.
InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.
…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.
Verify stock results artifact maps to client.* metrics via FakeOrch.
| if not orch.verify_containers_running(name): | ||
| lifecycle.failed = True | ||
| pytest.fail(f"container {name} not running after setup_containers()") | ||
| time.sleep(30) |
There was a problem hiding this comment.
will you need this gap of 30 secs after the testcase?
|
|
||
|
|
||
| def _du_bytes(orch, path): | ||
| out = orch.exec(f"bash -c {shlex.quote(f'du -sb {shlex.quote(path)} 2>/dev/null | cut -f1')}") |
There was a problem hiding this comment.
The 2>/dev/null silently swallows errors, and the function returns 0 on any failure. Since test_model_fetch uses the return value of 0 to detect "no model present," a network failure, permission error, or du not being available would be indistinguishable from a genuinely missing model
| lifecycle.record(request.node.nodeid, "server_ready", time.monotonic() - t) | ||
| t_client = time.monotonic() | ||
| job.run_client() | ||
| job.wait_client_complete() |
There was a problem hiding this comment.
better to have a timeout here. If the inference client hangs, this will block indefinitely with no recovery path.
| return fp.read().strip() | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") |
There was a problem hiding this comment.
will the scope be session or module?
anujmittal-amd
left a comment
There was a problem hiding this comment.
Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this.
Same comment for single/distributed; do we need a separate folder for single/multi?
dtni folder should be renamed based on vLLM structure.
Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.
Thanks
@anujmittal-amd I addressed the structural feedback on the IX-atom branch, this is the PR for it: #238 |
Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.
Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.
Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.
Summary
vllm(ROCm images wherepython3is not the vLLM interpreter), and hardensdocker exec/bash -cquoting.Test plan
cvs run inferencemax_single(or your usual InferenceMax path) on MI300x + current vLLM container.