Hnimrama/ix atom by hnimra-amd · Pull Request #238 · ROCm/cvs

hnimra-amd · 2026-06-23T23:27:31Z

PR #238 — InferenceX ATOM W1 (MI300X perf gates)

PR: #238
Head: hnimrama/IX-atom → Base: dev/dtni

Motivation

CVS needs a first-class InferenceX ATOM automation path aligned with the DTNI Validation Tracker (IX ATOM) — not the legacy inferencemax_single uplift. This PR:

Introduces the inferencex_atom_single suite with an ATOM-native driver (atom.entrypoints.openai_server + atom.benchmarks.benchmark_serving).
Ships W1 DeepSeek R1 FP8 variant configs for MI300X and MI355X (perf, smoke, MTP3) with calibrated / CI-seeded thresholds.
Closes M1 / Phase A on MI300X: W1 perf with enforce_thresholds: true after lab confirmation.
Documents the IX-atom roadmap (W1–W18, accuracy, metric tiers, parity frameworks) in plans/inferencex-atom-cvs-automation-plan.md.

MI355X variant configs ship with enforce_thresholds: false until hardware is available — they do not block merge or MI300X milestone work.

Base branch: dev/dtni.

Technical Details

Suite and orchestration

Rename inferencemax_single → inferencex_atom_single; legacy inferencemax/ configs removed.
New InferenceXAtomJob (inferencex_atom_orch.py):
- params.driver=atom → ATOM server + benchmark_serving; JSON artifacts parsed via to_client_metrics.
- params.driver=vllm retained for interim GPT-OSS uplift variants only.
Canonical config layout: flat sibling pairs under cvs/input/config_file/inference/inferencex_atom_single/:
- {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}]_config.json
- {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}]_threshold.json
schema_version: 1, typed loader, and ix_recipes.json recipe pins (dsr1-fp8-mi300x-atom, etc.).
Cluster examples: mi300x_atom_single.json, mi355x_atom_single.json; container names pinned (inferencex_atom_mi300x / inferencex_atom_mi355x).
Shared suite helpers: inference_suite_lifecycle.py, inference_suite_results_table.py.

W1 variants shipped

Variant stem	Arch	`enforce_thresholds`	Notes
`mi300x_inferencex-atom-single_deepseek-r1_fp8_perf`	MI300X	`true`	M1 gate — ISL=OSL=1024, CONC=128/256, 1000 prompts
`mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke`	MI300X	`false`	128-prompt pre-gate
`mi300x_inferencex-atom-single_deepseek-r1_fp8_mtp3`	MI300X	`false`	MTP3 recipe
`mi355x_inferencex-atom-single_deepseek-r1_fp8_perf`	MI355X	`false`	CI-seeded thresholds (plan Section 4.3)
`mi355x_inferencex-atom-single_deepseek-r1_fp8_mtp3`	MI355X	`false`	CI-seeded

Threshold / metrics plumbing

Tiered gates via test_cell_metrics (METRIC_TIERS: throughput, ttft, tpot, health, record) — one pytest row per tier per sweep cell.
to_client_metrics: derive client.failed when ATOM omits it (num_prompts - completed), then compute client.success_rate; add client.output_tput_per_gpu.
Tpot tier gates p99_tpot_ms (ATOM benchmark_serving emits p99 tails; p95_tpot_ms may be absent even with metric_percentiles: "95,99"). Tier enforcement skips metrics missing from the artifact.
MI300X W1 perf thresholds recalibrated from 2026-06-25 lab run (throughput mins = measured × 0.9; latency maxes = measured × 1.1).

Platform / shared infra (supporting changes)

cvs/lib/utils/ — shared config_loader, verdict, sweep selector.
vllm_single suite refactor to the same lifecycle / metric pattern.
Runbook: cvs/input/config_file/inference/inferencex_atom_single/README.md.

Out of scope (follow-up on `dev/dtni`)

M2 gsm8k accuracy
M3 P1 workloads (W2, W3, W13, W17)
M4 inferencex_atom_vllm_single / inferencex_atom_sglang_single parity frameworks
M5 multi-node (prioritized immediately after M4 parity)

Test Plan

CI / unit (no GPU)

pytest cvs/tests/inference/inferencex_atom/ -q
pytest cvs/lib/inference/unittests/test_inferencex_atom_parsing.py -q
pytest cvs/lib/inference/unittests/test_inferencex_atom_config_loader.py -q
Config loader / sweep selector / orch parse tests pass

Lab — MI300X smoke

Launcher: CTR-SVDT-L005 (10.7.54.167) · GPU node: 10.245.135.75

cd ~/cvs && make install && source .cvs_venv/bin/activate

SMOKE_DIR=~/input/config_file/inference/inferencex_atom_single/smoke
mkdir -p "$SMOKE_DIR"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_config.json \
  --output "$SMOKE_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_config.json"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_threshold.json \
  --output "$SMOKE_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_threshold.json"

TS=$(date +%Y%m%d_%H%M%S)
HTML="$HOME/cvs_results/${TS}_ix-atom-smoke_mi300x.html"
LOG="$HOME/cvs_results/${TS}_ix-atom-smoke_mi300x.log"

cvs run inferencex_atom_single \
  --cluster_file ~/input/cluster_file/mi300x_atom_single.json \
  --config_file "$SMOKE_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_config.json" \
  --html="$HTML" --self-contained-html --log-file="$LOG" -vvv -s

11 passed, 0 failed (~13 min)

Lab — MI300X W1 perf (M1 gate)

cd ~/cvs && make install && source .cvs_venv/bin/activate

PERF_DIR=~/input/config_file/inference/inferencex_atom_single/perf
mkdir -p "$PERF_DIR"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_config.json \
  --output "$PERF_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_config.json"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_threshold.json \
  --output "$PERF_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_threshold.json"

TS=$(date +%Y%m%d_%H%M%S)
HTML="$HOME/cvs_results/${TS}_ix-atom-w1-perf_mi300x.html"
LOG="$HOME/cvs_results/${TS}_ix-atom-w1-perf_mi300x.log"

cvs run inferencex_atom_single \
  --cluster_file ~/input/cluster_file/mi300x_atom_single.json \
  --config_file "$PERF_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_config.json" \
  --html="$HTML" --self-contained-html --log-file="$LOG" -vvv -s

Lifecycle stages pass (container, model fetch, server start, benchmark client, teardown).
Both sweep cells: CONC=128, CONC=256 (ISL=OSL=1024, TP=8); server reused across cells.
All metric tiers pass under enforce_thresholds: true.
HTML report attached to PR.

MI355X

Not required for merge — configs ship; flip enforce_thresholds when hardware is available.

Test Result

MI300X lab — `mi300x_inferencex-atom-single_deepseek-r1_fp8_perf`

Branch: hnimrama/IX-atom @ 07c90a7
Launcher: CTR-SVDT-L005 (10.7.54.167) · GPU node: 10.245.135.75
Install: make install from repo root (.cvs_venv)
Image: rocm/atom-dev:latest
Model: deepseek-ai/DeepSeek-R1-0528 (FP8, TP=8)
Outcome: 17 passed, 0 failed (~22 min)

Cell	`client.output_throughput` (measured)	Threshold (min)	Result
CONC=128	2867 tok/s	2580 tok/s	PASS
CONC=256	4697 tok/s	4227 tok/s	PASS

Cell	TTFT / TPOT gates	Result
CONC=128	mean TTFT 811 ms (max 892); p99 TTFT 6511 ms (max 7162); mean TPOT 42.5 ms; p99 TPOT 46.7 ms (max 51.4)	PASS
CONC=256	mean TTFT 728 ms; p99 TPOT 59.7 ms (max 65.6)	PASS

MI300X lab — smoke

Outcome: 11 passed, 0 failed (~13 min)

Artifacts (attach to PR)

Smoke HTML: 20260625_154318_ix-atom-smoke_mi300x.htmll
Smoke zip: inferencex_atom_single_2026-06-25T155606.zip
W1 perf HTML:[
20260625_173442_ix-atom-w1-perf_mi300x.html
W1 perf zip:
inferencex_atom_single_2026-06-25T175704.zip

Unit tests

CI / local unit suite green on PR branch.

Submission Checklist

PR open: #238
Jira linked: AIMVT-236, AIMVT-244
Base branch is dev/dtni
Lab run used make install from this branch (not a stale site-packages install)
MI300X W1 perf + smoke HTML reports attached
enforce_thresholds: true only on MI300X W1 perf — confirmed in lab
MI355X variants left at enforce_thresholds: false (pending hardware)
Plan doc reviewed for milestone alignment
Reviewer aware M2 (gsm8k), M3 (W2/W3/W13/W17), and M5 (multi-node) are follow-ups on dev/dtni

Stacked PR body covering motivation, technical details, test plan, lab results, and checklist targeting dev/dtni.

Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.

Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.

Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.

Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.

Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling.

vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.

Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.

…st polling Gate benchmark success on Failed requests only after the summary is present; tail more client log lines for InferenceMax. Variant and benchmark_params accept bench_max_failed_requests (default 0 remains strict for CI).

Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.

…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.

…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.

…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.

Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.

Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.

…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.

… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.

…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.

Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.

InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.

…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.

Verify stock results artifact maps to client.* metrics via FakeOrch.

…ngle Adopt InferenceX ATOM as the framework identity while the suite is still internal. Renames the driver, config loader, pytest suite, variant configs, and documentation to inferencex_atom_single.

Add Section 1.2 hardware policy, revise M1/Phase A exit criteria, and update milestone diagrams so MI355X confirmation is optional until nodes are available.

…log. Align branch state and phases with ATOM driver reality, add accuracy test catalog, metric tiers, and platform enhancements; update W1 README lab notes.

…and metrics. Document perf variant modes, workload-specific accuracy tests, inferencex_atom_vllm/sglang parity suites, supplemental and MTP metrics, and CI compare keys.

Re-apply merge resolutions from PR #233 alignment (dual threshold layout, vllm_single imports) that were lost when replaying commits onto e47df5a.

Route inferencemax_repo and framework=inferencemax to inferencex_atom with a warning. InferenceMax host jobs remain a placeholder that points callers at inferencex_atom_single.

Add validate_thresholds_cover_sweep() for reuse by vllm_single and inferencex_atom loaders. Optional gated_metrics parameter allows framework-specific SLO sets.

Rename W1 and GPT-OSS variant stems to {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}] and remove per-variant subfolders. Update README and docs with new copy-config paths and W1 threshold gates for per-GPU throughput and tail latencies.

Add inferencex_atom_parsing with IX-specific GATED_METRICS (per_gpu_throughput, output_tput_per_gpu) without changing vllm_single. Wire InferenceXAtomJob and the suite to atom parsing; default metric_percentiles to 95,99 for p95 TPOT and p99 TTFT.

Clarify that W1 GATED_METRICS and output_tput_per_gpu live in inferencex_atom_parsing so vllm_single stays untouched until parity.

Drop vllm_single comparisons and internal threshold tables; keep naming pattern, variant list, and copy/run commands.

…aits

…oke waits

…ests

…ates

Document per-variant ~/input subdirs to avoid ambiguous threshold discovery, remote launcher vs GPU node prerequisites, and ~/cvs_results output paths.

Elevate scaling to P1 milestone M5 immediately after M4 parity when hardware and suite recipes support nnodes>1; defer MTP+P2 widen to M6.

Gate p99_tpot_ms instead of absent p95_tpot_ms, skip missing tier metrics in actuals, and recalibrate MI300X perf thresholds from the 2026-06-25 lab run.

Replace per-node calibrated gates with conservative throughput floors and loose latency caps so healthy runs pass across lab nodes without recalibration.

atnair-amd

Automated review from five-pass analysis (structure · duplication · unit tests · code quality · live validation run).

Blockers (must fix before merge): false-green on missing config_file · parse_results error paths untested · _client_log_failures untested
Majors: driver default guard · config/threshold structure diverges from vllm · server reuse helpers untested · build_server_cmd suppression untested · _merged_serve_args untested · tier-explosion untested · reuse_server_across_sweep default · no ATOM early-failure detection · wheel not in shared venv
Minors/NITs: see inline comments

No changes are requested without reviewer approval — comments only.

atnair-amd · 2026-06-25T22:33:55Z

+class InferenceXAtomParams(_Forbid):
+    # ``atom`` uses ATOM openai_server + benchmark_serving; ``vllm`` keeps the
+    # interim uplift path (vllm serve + vllm bench serve).
+    driver: Literal["atom", "vllm"] = "vllm"


[MAJOR] driver="vllm" default has no guard — silent misrouting for new ATOM configs

The two gpt-oss-120b configs intentionally omit driver and rely on this default (their serve args are vllm-specific — that is correct). But there is no safeguard for a new config author who adds an ATOM config and forgets the field: they silently get a different binary, different log paths, and a different client — all with exit 0. Suggest adding a Pydantic validator or a conftest-level log warning that fires when framework == "inferencex_atom_single" and driver == "vllm", so the mismatch is visible at run start rather than discovered in logs.

atnair-amd · 2026-06-25T22:33:55Z

@@ -0,0 +1,80 @@
+{


[MAJOR] Config files use ix_recipe_id indirection instead of the flat inline structure used by vllm — and threshold_json is absent

vllm configs are fully self-contained: model id, tensor parallelism, dtype, serve args, sweep, and threshold_json are all directly embedded. The ix-atom configs instead carry a single pointer (ix_recipe_id) and defer atom_args (e.g. -tp 8 --kv_cache_dtype fp8 --trust-remote-code), model id, and bench_extra_args to ix_recipes.json, resolved at load time by apply_ix_recipe. This means you cannot understand what a config will actually do without reading a second file.

Additionally, none of the ix-atom configs populate the threshold_json field (inherited from BaseVariantConfig). They rely on the proximity-glob fallback in substitute_config() which raises ValueError if more than one *threshold.json ever exists in the directory. The vllm pattern is explicit: each config carries "threshold_json": "<path>".

Suggested fix: (1) flatten atom_args and key recipe fields inline into the config (as vllm does with roles.server.serve_args), and (2) add "threshold_json": "<path>" to each config file pointing at its paired threshold.

atnair-amd · 2026-06-25T22:33:55Z

+def orch(cluster_dict, variant_config, lifecycle):
+    container_block = _deep_merge(
+        cluster_dict.get("container", {}),
+        orchestrator_container_from_variant(variant_config),


[NIT] orchestrator_container_from_variant call site has no comment explaining its side-effect

The function now has a docstring (includes server env), but this call site has no annotation. A future author who copies the conftest and swaps back to variant_config.container.model_dump() silently drops server env vars. Suggest adding: # also injects roles.server.env — do not replace with .container.model_dump()

atnair-amd · 2026-06-25T22:33:56Z

+    lifecycle,
+    request,
+):
+    """One pytest row per metric tier per sweep cell (W1 gate batches)."""


[MINOR] test_cell_metrics failure modes are undocumented in the docstring

There are now two pytest.fail paths (lines 185 and 192–195) that are stricter than vllm's silent-return behaviour: (1) no threshold specs exist for a gated tier, and (2) specs exist but none of the expected metrics are present in the actuals. Neither is documented in the docstring, making it hard for future authors to understand the contract. Suggest updating the docstring to describe both failure conditions.

atnair-amd · 2026-06-25T22:33:56Z

+
+
+@pytest.fixture(scope="module")
+def server_session():


[MINOR] server_session fixture has no docstring

This is the only fixture in conftest.py with no vllm analogue and its purpose is non-obvious. Suggest adding: """Tracks the active server session key to allow reuse across sweep cells when reuse_server_across_sweep=true."""

atnair-amd · 2026-06-25T22:33:56Z

+        self.orch.exec(f"mkdir -p {shlex.quote(self.out_dir)}")
+
+    @classmethod
+    def _merged_serve_args(cls, variant):


[MAJOR] _merged_serve_args gpu-memory-util promotion is untested

The promotion of CVS_GPU_MEMORY_UTIL from server_env into gpu-memory-utilization in the serve-args dict (lines 172–174) is a separate code path from orchestrator_container_from_variant and has no test. Please add a test asserting the promotion happens when the key is absent from merged, and is a no-op when already present.

atnair-amd · 2026-06-25T22:33:56Z

+    with open(config_file) as fp:
+        raw = json.load(fp)
+    cases, ids = expand_sweep(raw.get("sweep", {}))
+    if "metric_tier" in metafunc.fixturenames:


[MAJOR] pytest_generate_tests tier-explosion path is untested

The "metric_tier" in metafunc.fixturenames branch (line 84) and the tier × sweep multiplication (lines 85–92) are a separate code path from expand_sweep (which is tested). The len(cases) * len(METRIC_TIER_ORDER) product is not pinned anywhere. Please add a test using a SimpleNamespace fake metafunc (with a recording .parametrize) asserting the correct number of parametrize calls and IDs for a known sweep config.

atnair-amd · 2026-06-25T22:33:56Z

+
+
+def pytest_generate_tests(metafunc):
+    config_file = metafunc.config.getoption("config_file")


[BLOCKER] Missing config_file produces a false-green zero-test run

Lines 78–79: if not config_file or not os.path.isfile(config_file): return causes pytest to collect 0 tests and exit 0 — indistinguishable from a fully passing run. This is a silent failure mode: a typo in --config_file or a missing file will look like success in CI. Please replace the return with pytest.exit("--config_file not found or not specified", returncode=1) or raise pytest.UsageError.

atnair-amd · 2026-06-25T22:33:56Z

+
+
+def _reuse_server_flag(params) -> bool:
+    raw = str(getattr(params, "reuse_server_across_sweep", "true")).strip().lower()


[MAJOR] _reuse_server_flag defaults to "true" when reuse_server_across_sweep is absent from config

getattr(params, "reuse_server_across_sweep", "true") — a config that omits the field silently enables server reuse, which is a surprising default (especially for smoke runs). The same default is set at inferencex_atom_config_loader.py line 68. Suggest changing to "false" (safe: always restart) or making the field required in the schema.

atnair-amd · 2026-06-25T22:33:56Z

+        log.info("waiting %ds for server log to materialise", self._precheck_wait)
+        time.sleep(self._precheck_wait)
+
+        if self.driver != "atom":


[MAJOR] No early-failure detection for the ATOM driver during the wait_ready poll loop

Lines 294–298: the EARLY_FAILURE_RE log scan is gated on if self.driver != "atom" — the atom driver skips it entirely. During the smoke run the server was silent for 29 minutes with 28 consecutive health-poll NO responses; a crashed server would be indistinguishable from a loading one until timeout. start_server does check EARLY_FAILURE_RE on the immediate nohup shell output (line 257) but that only catches shell-level launch failures, not server-side startup failures written to the log file during the load window. Suggest adding EARLY_FAILURE_RE pattern matching against the ATOM server log inside the wait_ready poll loop, analogous to the vllm path.

amd-droy · 2026-06-26T14:20:54Z

+        '_cvs_run_bench() { case "$BENCH_KIND" in cli_module) '
+        '"$BENCH_PY" -m vllm.entrypoints.cli.main bench serve "$@" '
+        ';; *) "$BENCH_PY" "$BENCH_SCRIPT" "$@" ;; esac; }; '
+        "export BENCH_PY BENCH_KIND BENCH_SCRIPT"


This function returns a long shell string containing embedded Python and may create several problems.

RECOMMENDED

PYTHON_CANDIDATES = ["python3.13", "python3.12", "python3.11", "python3.10", "python3"]

def _build_interpreter_loop(py_q: str, candidates: list[str]) -> str:
"""Build the shell interpreter-discovery loop separately for testability."""
candidate_str = " ".join(candidates)
return (
f"for _cvs_py in {candidate_str}; do\n"
f" _cvs_eval=$($_cvs_py -c {py_q}) || continue\n"
...
)

amd-droy · 2026-06-26T14:37:17Z

+        _inserted = True
+if not _inserted:
+    CLIENT_METRICS.append(_METRIC_INSERT)
+CLIENT_METRIC_UNITS = dict(CLIENT_METRICS)


Alternative:

idx = next((i for i, (n, _) in enumerate(_VLLM_CLIENT_METRICS) if n == "per_gpu_throughput"), None)
CLIENT_METRICS = list(_VLLM_CLIENT_METRICS)
CLIENT_METRICS.insert((idx + 1) if idx is not None else len(CLIENT_METRICS), _METRIC_INSERT)

amd-droy · 2026-06-26T14:41:30Z

+            saw_marker = True
+            continue
+        if text.isdigit():
+            total = max(total, int(text))


Taking max across hosts means the return value is the largest single-host measurement, not the total bytes across the distributed model cache. I think it should be
if text.isdigit():
total += int(text) # sum shards, not max shard

amd-droy · 2026-06-26T14:43:24Z

+    if not rows:
+        return
+    try:
+        import pytest_html


you can move this to the top in the import section

hnimra-amd marked this pull request as draft June 23, 2026 23:28

hnimra-amd mentioned this pull request Jun 24, 2026

Hnimrama/inferencemax uplift restore #229

Open

1 task

hnimra-amd marked this pull request as ready for review June 24, 2026 16:28

hnimra-amd requested review from amd-droy, anujmittal-amd and atnair-amd June 24, 2026 16:28

hnimra-amd added a commit that referenced this pull request Jun 24, 2026

docs: add PR #238 description draft for IX-atom M1 merge.

773ed14

Stacked PR body covering motivation, technical details, test plan, lab results, and checklist targeting dev/dtni.

hnimra-amd force-pushed the hnimrama/IX-atom branch from d62122b to e17ab7f Compare June 25, 2026 17:30

hnimra-amd added 22 commits June 25, 2026 13:09

Restore InferenceMax uplift reverted by #228

34487f9

Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.

fix(inference): run vLLM bench client with vLLM interpreter

dfe6ac4

Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.

fix(dtni): broaden vLLM benchmark script discovery

870eac2

Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.

fix(inference): harden InferenceMax server startup and GPU mem env

e933d5e

Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.

fix(dtni): clamp bench random-range to max_model_length

8e9ba40

vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.

fix(inference): extend InferenceMax bench client poll budget

e2f6aeb

Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.

feat(inference): add typed InferenceMax config loader (Phase 1)

f32ba0a

Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.

feat(inference): migrate InferenceMax configs to schema_version 1 (Ph…

c1c479b

…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.

test(inference): wire inferencemax_single to typed config and sweep (…

903c071

…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.

docs(inference): update InferenceMax config reference for schema_vers…

8975eb8

…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.

docs: fix stale dtni.config_loader references (Phase 1 tail)

0269579

Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.

feat(inference): rewrite InferenceMaxJob like VllmJob (Phase 3)

35542b9

Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.

feat(inference): move InferenceMax server flags to roles.server.serve…

73171d6

…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.

test(inference): align inferencemax_single suite with VllmJob pattern…

c310154

… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.

docs(inference): update InferenceMax reference for Phase 3 driver (Ph…

d036290

…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.

chore(inference): remove unused inferencemax_host_scripts (Phase 5)

5ed8583

Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.

docs: clarify vllm_benchmark_scripts are legacy-only (Phase 5)

db72ca8

InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.

docs(inference): rewrite InferenceMax reference for schema_version 1 …

13b8f10

…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.

test(inference): add InferenceMaxJob parse_results unit test (Phase 5)

414044c

Verify stock results artifact maps to client.* metrics via FakeOrch.

refactor(inference): rename inferencemax_single to inferencex_atom_si…

9b00085

…ngle Adopt InferenceX ATOM as the framework identity while the suite is still internal. Renames the driver, config loader, pytest suite, variant configs, and documentation to inferencex_atom_single.

hnimra-amd added 25 commits June 25, 2026 13:09

docs(plan): MI355X lab pending without blocking MI300X spine.

a2e2537

Add Section 1.2 hardware policy, revise M1/Phase A exit criteria, and update milestone diagrams so MI355X confirmation is optional until nodes are available.

docs(plan): refresh IX-atom plan with accuracy, metrics, and CVS back…

5d9f630

…log. Align branch state and phases with ATOM driver reality, add accuracy test catalog, metric tiers, and platform enhancements; update W1 README lab notes.

docs(plan): add Section 12 coverage for variants, parity frameworks, …

eef5da8

…and metrics. Document perf variant modes, workload-specific accuracy tests, inferencex_atom_vllm/sglang parity suites, supplemental and MTP metrics, and CI compare keys.

align: restore config_loader threshold discovery after dev/dtni rebase

f1a3c9b

Re-apply merge resolutions from PR #233 alignment (dual threshold layout, vllm_single imports) that were lost when replaying commits onto e47df5a.

refactor(inference): deprecate InferenceMax legacy factory paths

fcd68b1

Route inferencemax_repo and framework=inferencemax to inferencex_atom with a warning. InferenceMax host jobs remain a placeholder that points callers at inferencex_atom_single.

refactor(inference): extract shared threshold sweep validation

56a1808

Add validate_thresholds_cover_sweep() for reuse by vllm_single and inferencex_atom loaders. Optional gated_metrics parameter allows framework-specific SLO sets.

docs(inferencex): document ATOM-specific parsing vs vllm_single

e2a833c

Clarify that W1 GATED_METRICS and output_tput_per_gpu live in inferencex_atom_parsing so vllm_single stays untouched until parity.

docs(inferencex): simplify variant README for lab users

2e50943

Drop vllm_single comparisons and internal threshold tables; keep naming pattern, variant list, and copy/run commands.

chore(utils): silence cluster placeholder resolution logs

a4df2e2

test(config): align config_loader tests with sibling threshold discovery

1ebfda1

feat(inferencex): add W1 metric tiers for tiered threshold gates

5b22b78

feat(inferencex): reuse server across sweep cells and config-driven w…

d37b81c

…aits

feat(inferencex): replace per-metric tests with tiered test_cell_metrics

721df2f

chore(inferencex): enable server reuse on perf configs and shorter sm…

c1c4f80

…oke waits

fix(inferencex): align cluster container names with variant configs

af8435b

fix(inferencex): tighten W1 perf health gates when enforcing thresholds

e3d11d9

test(inferencex): add GATED_METRICS parity and health gate coverage t…

c5ecfe3

…ests

test(inferencex): add GATED_METRICS parity and health gate coverage t…

e310d6e

…ests

docs(inferencex): update plan and README for flat layout and tiered g…

47f8eb0

…ates

docs(inferencex): update plan and README for flat layout and tiered g…

c334aae

…ates

chore(inferencex): trim expand_sweep docstring

d742557

docs(inferencex): clarify lab layout, launcher host, and results paths

9d4f33a

Document per-variant ~/input subdirs to avoid ambiguous threshold discovery, remote launcher vs GPU node prerequisites, and ~/cvs_results output paths.

docs(plan): prioritize multi-node as M5 after framework parity

73dfb65

Elevate scaling to P1 milestone M5 immediately after M4 parity when hardware and suite recipes support nnodes>1; defer MTP+P2 widen to M6.

hnimra-amd force-pushed the hnimrama/IX-atom branch from 544f8ad to 73dfb65 Compare June 25, 2026 20:15

hnimra-amd added 2 commits June 25, 2026 13:29

fix(inferencex): align W1 tpot tier with ATOM bench output

07c90a7

Gate p99_tpot_ms instead of absent p95_tpot_ms, skip missing tier metrics in actuals, and recalibrate MI300X perf thresholds from the 2026-06-25 lab run.

chore(inferencex): use portable W1 perf thresholds on MI300X

0663460

Replace per-node calibrated gates with conservative throughput floors and loose latency caps so healthy runs pass across lab nodes without recalibration.

atnair-amd reviewed Jun 25, 2026

View reviewed changes

amd-droy reviewed Jun 26, 2026

View reviewed changes



		def pytest_generate_tests(metafunc):
		config_file = metafunc.config.getoption("config_file")



		def _reuse_server_flag(params) -> bool:
		raw = str(getattr(params, "reuse_server_across_sweep", "true")).strip().lower()

Uh oh!

Conversation

hnimra-amd commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #238 — InferenceX ATOM W1 (MI300X perf gates)

Motivation

Technical Details

Suite and orchestration

W1 variants shipped

Threshold / metrics plumbing

Platform / shared infra (supporting changes)

Out of scope (follow-up on dev/dtni)

Test Plan

CI / unit (no GPU)

Lab — MI300X smoke

Lab — MI300X W1 perf (M1 gate)

MI355X

Test Result

MI300X lab — mi300x_inferencex-atom-single_deepseek-r1_fp8_perf

MI300X lab — smoke

Artifacts (attach to PR)

Unit tests

Submission Checklist

Uh oh!

atnair-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

RECOMMENDED

Uh oh!

Choose a reason for hiding this comment

Alternative:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnimra-amd commented Jun 23, 2026 •

edited

Loading

Out of scope (follow-up on `dev/dtni`)

MI300X lab — `mi300x_inferencex-atom-single_deepseek-r1_fp8_perf`