Skip to content

Hnimrama/inferencemax uplift restore#229

Open
hnimra-amd wants to merge 46 commits into
dev/dtnifrom
hnimrama/inferencemax-uplift-restore
Open

Hnimrama/inferencemax uplift restore#229
hnimra-amd wants to merge 46 commits into
dev/dtnifrom
hnimrama/inferencemax-uplift-restore

Conversation

@hnimra-amd

Copy link
Copy Markdown

Summary

Test plan

  • cvs run inferencemax_single (or your usual InferenceMax path) on MI300x + current vLLM container.

atnair-amd and others added 27 commits June 18, 2026 21:07
…inery

Rename cvs/lib/dtni to cvs/lib/utils ("utils" says what it is: pure
functions any lib can call; "dtni" was a leftover project codename).
verdict.py moves unchanged.

config_loader.py is trimmed to the framework-agnostic half: the
paths/model/image/container schema, the 3-pass placeholder substitution,
the enforce_thresholds gate on a new BaseVariantConfig, and a
substitute_config() helper (file read + substitution + sibling-threshold
discovery). The inference-only schema moves to a sibling module in a
later commit.
The client.* metric parser (to_client_metrics + CLIENT_METRICS surface)
is inference-specific and should not sit in the shared utils dir. Move it
under a new cvs/lib/inference/utils package. Content unchanged.
The inference half of the old config_loader: GoodputSlo, SeqCombo, Sweep,
Params, Roles, and VariantConfig(BaseVariantConfig) with cell_key and the
threshold-coverage check. load_variant() delegates the file read and
placeholder substitution to the generic substitute_config().

Replaces the sequence_combinations x concurrency_levels cartesian with a
named-combo + explicit runs[] selector: each run is a {combo, concurrency}
pair, so the config enumerates exactly the cells to run (no NxM explosion).
A model_validator rejects duplicate combo names and runs referencing an
unknown combo at load time.
…rsing

build_server_cmd/start_server assemble a `vllm serve` arg list in Python
(mirroring run_client) instead of cloning and running an external .sh, so
a run needs no hand-staged script. --max-model-len is derived per cell
from isl/osl/random_range_ratio so any sweep change stays self-consistent.

parse_results reads the stock extensionless `results` JSON artifact that
`vllm bench serve` writes to --result-dir and delegates namespacing +
derived-metric math to vllm_parsing.to_client_metrics, replacing the
brittle console-log regex table. Missing/empty/unparseable artifacts
hard-fail the cell rather than recording a silently-green empty row.

Adds the optional --goodput SLO gate (per-cell, omitted when no SLO) and
threads goodput_slo through run().
pytest_generate_tests now drives parametrization from the named-combo +
runs[] selector instead of the cartesian. test_vllm_inference only runs
the benchmark and stashes results; the verdict moves into a new
test_metric (one pytest test = one HTML row per metric per cell), with
inline Value/Unit columns added via pytest_html hooks in conftest.

test_setup_sshd gates its 2224 probe on len(orch.hosts) > 1, mirroring
the single-node orchestrator guard: single-node runs skip the in-container
sshd (it exists only for inter-node MPI) and must not probe for it.

Import paths follow the dtni -> utils / inference.utils moves.
…luster file

Rename the config/threshold pair to {model}_{precision}_{config|threshold}
.json and convert the sweep to the named-combo + runs[] selector. Pin the
image to rocm/vllm-dev:nightly (the previously pinned :nightly-sshd tag
does not exist on Docker Hub, and single-node runs skip in-container sshd).

Delete cvs/input/cluster_file/mi300x_vllm_single.json: a cluster file only
needs node IP + user/key/orchestrator; the variant config supplies the
container block, so the bespoke per-suite cluster file is redundant.
…uards

Cover to_client_metrics purity + derived metrics, the named-combo/runs
sweep selector (expansion, unknown-combo and duplicate-name rejection),
the run_client goodput/metric-percentiles flags, table-cell rendering, and
the verdict None-guards. Adds JSON fixtures for the stock results artifact.
Add a human reference guide (plans/building-a-cvs-test-suite.md) that walks
the six-layer suite architecture using vllm_single as the worked example:
the generic <-> framework config seam, the named-combo + runs[] sweep
selector, the self-contained Python-built server cmd, lifecycle-as-tests,
and a checklist for authoring a new inference or training suite.

Add per-package AGENTS.md docs naming the public entry points, the seam,
and the non-obvious gotchas:
  - cvs/lib/utils: substitute_config / BaseVariantConfig / evaluate_all,
    the 3-pass placeholder order, sibling-glob threshold discovery,
    parent-first validator ordering.
  - cvs/lib/inference/utils: load_variant / to_client_metrics / CLIENT_METRICS,
    the cell_key single-source-of-truth, the coverage check that prevents a
    silent green, and the validators mirrored in pytest_generate_tests.
… rules

Document the dtni -> utils rename and the shared (cvs/lib/utils) vs
domain-specific (cvs/lib/inference/utils) split: what lives where, the rule
for placing a new helper, and the directory map. Note training is not yet
ported and this guide is the blueprint for that port.

Remove the manual --- horizontal rules between sections: heading levels
already render their own bottom border, so the extra rules produced a
double-underline. Minor prose/format cleanups.
The threshold file carried placeholder CONC=64/128/256 entries while the
sweep's only run is concurrency 16, so cell_key() matched no threshold.
The mismatch was masked by enforce_thresholds=false (warned, not raised)
and would have failed load the instant enforcement was flipped on. Re-key
to the single CONC=16 cell the runs selector actually enumerates.
MODEL/ISL/OSL/MAX_MODEL_LEN/RANDOM_RANGE_RATIO/TP/CONC/PORT were exported
into /tmp/server_env_script.sh but read by nothing after the .sh->Python
server command refactor -- both _server_argv and run_client pass these as
explicit flags. Keep only the env the vllm process actually consumes
(HF token, HF cache pin, AITER flags). Also drops the second
_derive_max_model_len call that fed the dead MAX_MODEL_LEN export.
_server_argv hard-coded --kv-cache-dtype fp8, baking a per-model property
into the shared orchestrator -- a non-fp8-KV model dropped into the suite
would be served wrong with no config recourse (extra_serve_args can only
add, so an override would pass the flag twice). Declare it in the W1
config's roles.server.extra_serve_args instead; the driver stays
model-agnostic and 'new model = new config' holds.
…nd collection

pytest_generate_tests hand-reimplemented the duplicate-name and
unknown-run.combo checks that Sweep._check_runs_reference_known_combos
already enforces, with divergent semantics (first-failure raise vs
all-at-once). Extract validate_sweep_selector() as the single home and
call it from both the typed validator (load time) and the collection-time
raw-JSON path so the rule can't drift.
out_dir was fixed per job, so a multi-cell sweep would overwrite each
cell's `results` and client.log, and parse_results could cat a prior
cell's stale artifact if the current cell's client failed to write one.
Key it by cell. Latent today (the shipped sweep has one cell).
run_client accepted goodput_slo as either a dict (.get) or an object
(getattr) via a per-key hasattr branch, but the only production caller
passes a raw dict -- the object path existed solely for a unit test, and
the dual path meant the typed GoodputSlo's validation never reached the
command builder. Consume the dict only and drop the object-form test.
goodput_slo_unused was never read (goodput is threaded through _make_job).
The committed vllm_single example config carried a personal hf-token path
and a concrete image tag (duplicated in image.tag and container.image).
Replace all three with <changeme> so the file is a template a new user must
fill in -- it still loads (collection works) and only fails at run time when
an unedited value is read, which is the intended signal.
The per-model server knobs were a flat [flag, value, flag, value] list
(roles.server.extra_serve_args), which reads poorly. Replace with a
roles.server.serve_args {flag: value} map (flag without the leading --):
a scalar renders --flag value, True a bare --flag, a list the flag once per
element -- so it stays readable while still covering vllm bare/repeatable
flags. _server_argv flattens the map via a new _flatten_serve_args helper;
the derived flags (tp/max-model-len/port) stay code-built.

Also repoints a stale unit test that asserted MAX_MODEL_LEN in the env
script (it moved to the --max-model-len flag in an earlier commit) to assert
against the server argv instead.
The only ceiling kind was max_ms, whose message hard-codes ms. A count
metric like client.failed needs an upper bound without the unit lie; add
a plain max with the same comparison and an honest message.
Previously only cell-presence was validated: a cell could exist while a
given metric had no spec, and test_metric (spec is None -> return) would
silently report a green record-only row even under enforce_thresholds=true.
A new perf metric was thus unvalidated by default.

Declare GATED_METRICS beside CLIENT_METRICS -- the perf+health subset that
must assert (throughput, mean+p99 latency, success_rate/failed) -- and
extend _check_thresholds_cover_sweep to require a spec for every gated
metric in every present cell, reusing the same enforce-vs-warn path. A new
metric is record-only until added to the set; once gated, the loader forces
a spec in every cell before the suite can run green. Inputs, totals, and
derived diagnostics stay record-only by design.
The image was declared twice: top-level image.tag (live -- conftest copied
it onto the container block) and container.image (dead -- overwritten by that
copy). The duplicate forced a top-level image block whose remote field was
unused and whose tag silently shadowed container.image.

Make container.image the single source: drop the top-level ImageSpec block
from the generic BaseVariantConfig, drop the conftest overwrite so the merged
container.image is used as-is, and remove the now-schemaless image block from
the example config.
Expand GATED_METRICS from the mean+p99 subset to every emitted latency
quantile (mean/median/p90/p95/p99) for ttft, tpot, itl, and e2el -- itl
omits p90 as CLIENT_METRICS has no producer for it. Throughput and
success_rate/failed health are unchanged. Inputs, totals, secondary
throughputs, and derived diagnostics stay record-only.

The example threshold file gains a placeholder spec for each newly gated
metric (23 total) so the loader gated-coverage check passes.
- utils/AGENTS.md: drop top-level image (now container.image); add max verdict kind
- inference/utils/AGENTS.md: document GATED_METRICS contract + dual-axis coverage check
- building-a-cvs-test-suite.md: container.image, serve_args map, max kind, GATED_METRICS
- dtni-dev-guide.md: SUPERSEDED banner pointing to building guide + AGENTS.md
Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.
Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.
Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.
Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.
Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling.
vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.
Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.
…st polling

Gate benchmark success on Failed requests only after the summary is present;
tail more client log lines for InferenceMax. Variant and benchmark_params accept
bench_max_failed_requests (default 0 remains strict for CI).
@hnimra-amd hnimra-amd force-pushed the hnimrama/inferencemax-uplift-restore branch from 7e9bc30 to 730b408 Compare June 23, 2026 18:36
Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.
…ase 2)

Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.
…Phase 2)

Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.
…ion 1 (Phase 2)

Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.
Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.
Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.
…_args (Phase 3)

MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.
… (Phase 3)

Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.
…ase 3)

Document Python serve, client.* metrics, and expanded lifecycle test stages.
Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.
InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.
…(Phase 5)

Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.
Verify stock results artifact maps to client.* metrics via FakeOrch.
if not orch.verify_containers_running(name):
lifecycle.failed = True
pytest.fail(f"container {name} not running after setup_containers()")
time.sleep(30)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you need this gap of 30 secs after the testcase?



def _du_bytes(orch, path):
out = orch.exec(f"bash -c {shlex.quote(f'du -sb {shlex.quote(path)} 2>/dev/null | cut -f1')}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2>/dev/null silently swallows errors, and the function returns 0 on any failure. Since test_model_fetch uses the return value of 0 to detect "no model present," a network failure, permission error, or du not being available would be indistinguishable from a genuinely missing model

lifecycle.record(request.node.nodeid, "server_ready", time.monotonic() - t)
t_client = time.monotonic()
job.run_client()
job.wait_client_complete()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to have a timeout here. If the inference client hangs, this will block indefinitely with no recovery path.

return fp.read().strip()


@pytest.fixture(scope="session")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the scope be session or module?

@anujmittal-amd anujmittal-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this.
Same comment for single/distributed; do we need a separate folder for single/multi?
dtni folder should be renamed based on vLLM structure.

Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.

Thanks

@hnimra-amd

Copy link
Copy Markdown
Author

Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this. Same comment for single/distributed; do we need a separate folder for single/multi? dtni folder should be renamed based on vLLM structure.

Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.

Thanks

@anujmittal-amd I addressed the structural feedback on the IX-atom branch, this is the PR for it: #238
and it builds on the current PR

hnimra-amd added a commit that referenced this pull request Jun 24, 2026
Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.
hnimra-amd added a commit that referenced this pull request Jun 25, 2026
Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.
hnimra-amd added a commit that referenced this pull request Jun 25, 2026
Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants