Skip to content

Hnimrama/ix atom#238

Open
hnimra-amd wants to merge 60 commits into
dev/dtnifrom
hnimrama/IX-atom
Open

Hnimrama/ix atom#238
hnimra-amd wants to merge 60 commits into
dev/dtnifrom
hnimrama/IX-atom

Conversation

@hnimra-amd

@hnimra-amd hnimra-amd commented Jun 23, 2026

Copy link
Copy Markdown

PR #238 — InferenceX ATOM W1 (MI300X perf gates)

PR: #238
Head: hnimrama/IX-atomBase: dev/dtni

Jira: AIMVT-236 · AIMVT-244


Motivation

CVS needs a first-class InferenceX ATOM automation path aligned with the DTNI Validation Tracker (IX ATOM) — not the legacy inferencemax_single uplift. This PR:

  • Introduces the inferencex_atom_single suite with an ATOM-native driver (atom.entrypoints.openai_server + atom.benchmarks.benchmark_serving).
  • Ships W1 DeepSeek R1 FP8 variant configs for MI300X and MI355X (perf, smoke, MTP3) with calibrated / CI-seeded thresholds.
  • Closes M1 / Phase A on MI300X: W1 perf with enforce_thresholds: true after lab confirmation.
  • Documents the IX-atom roadmap (W1–W18, accuracy, metric tiers, parity frameworks) in plans/inferencex-atom-cvs-automation-plan.md.

MI355X variant configs ship with enforce_thresholds: false until hardware is available — they do not block merge or MI300X milestone work.

Base branch: dev/dtni.


Technical Details

Suite and orchestration

  • Rename inferencemax_singleinferencex_atom_single; legacy inferencemax/ configs removed.
  • New InferenceXAtomJob (inferencex_atom_orch.py):
    • params.driver=atom → ATOM server + benchmark_serving; JSON artifacts parsed via to_client_metrics.
    • params.driver=vllm retained for interim GPT-OSS uplift variants only.
  • Canonical config layout: flat sibling pairs under cvs/input/config_file/inference/inferencex_atom_single/:
    • {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}]_config.json
    • {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}]_threshold.json
  • schema_version: 1, typed loader, and ix_recipes.json recipe pins (dsr1-fp8-mi300x-atom, etc.).
  • Cluster examples: mi300x_atom_single.json, mi355x_atom_single.json; container names pinned (inferencex_atom_mi300x / inferencex_atom_mi355x).
  • Shared suite helpers: inference_suite_lifecycle.py, inference_suite_results_table.py.

W1 variants shipped

Variant stem Arch enforce_thresholds Notes
mi300x_inferencex-atom-single_deepseek-r1_fp8_perf MI300X true M1 gate — ISL=OSL=1024, CONC=128/256, 1000 prompts
mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke MI300X false 128-prompt pre-gate
mi300x_inferencex-atom-single_deepseek-r1_fp8_mtp3 MI300X false MTP3 recipe
mi355x_inferencex-atom-single_deepseek-r1_fp8_perf MI355X false CI-seeded thresholds (plan Section 4.3)
mi355x_inferencex-atom-single_deepseek-r1_fp8_mtp3 MI355X false CI-seeded

Threshold / metrics plumbing

  • Tiered gates via test_cell_metrics (METRIC_TIERS: throughput, ttft, tpot, health, record) — one pytest row per tier per sweep cell.
  • to_client_metrics: derive client.failed when ATOM omits it (num_prompts - completed), then compute client.success_rate; add client.output_tput_per_gpu.
  • Tpot tier gates p99_tpot_ms (ATOM benchmark_serving emits p99 tails; p95_tpot_ms may be absent even with metric_percentiles: "95,99"). Tier enforcement skips metrics missing from the artifact.
  • MI300X W1 perf thresholds recalibrated from 2026-06-25 lab run (throughput mins = measured × 0.9; latency maxes = measured × 1.1).

Platform / shared infra (supporting changes)

  • cvs/lib/utils/ — shared config_loader, verdict, sweep selector.
  • vllm_single suite refactor to the same lifecycle / metric pattern.
  • Runbook: cvs/input/config_file/inference/inferencex_atom_single/README.md.

Out of scope (follow-up on dev/dtni)

  • M2 gsm8k accuracy
  • M3 P1 workloads (W2, W3, W13, W17)
  • M4 inferencex_atom_vllm_single / inferencex_atom_sglang_single parity frameworks
  • M5 multi-node (prioritized immediately after M4 parity)

Test Plan

CI / unit (no GPU)

  • pytest cvs/tests/inference/inferencex_atom/ -q
  • pytest cvs/lib/inference/unittests/test_inferencex_atom_parsing.py -q
  • pytest cvs/lib/inference/unittests/test_inferencex_atom_config_loader.py -q
  • Config loader / sweep selector / orch parse tests pass

Lab — MI300X smoke

Launcher: CTR-SVDT-L005 (10.7.54.167) · GPU node: 10.245.135.75

cd ~/cvs && make install && source .cvs_venv/bin/activate

SMOKE_DIR=~/input/config_file/inference/inferencex_atom_single/smoke
mkdir -p "$SMOKE_DIR"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_config.json \
  --output "$SMOKE_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_config.json"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_threshold.json \
  --output "$SMOKE_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_threshold.json"

TS=$(date +%Y%m%d_%H%M%S)
HTML="$HOME/cvs_results/${TS}_ix-atom-smoke_mi300x.html"
LOG="$HOME/cvs_results/${TS}_ix-atom-smoke_mi300x.log"

cvs run inferencex_atom_single \
  --cluster_file ~/input/cluster_file/mi300x_atom_single.json \
  --config_file "$SMOKE_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_smoke_config.json" \
  --html="$HTML" --self-contained-html --log-file="$LOG" -vvv -s
  • 11 passed, 0 failed (~13 min)

Lab — MI300X W1 perf (M1 gate)

cd ~/cvs && make install && source .cvs_venv/bin/activate

PERF_DIR=~/input/config_file/inference/inferencex_atom_single/perf
mkdir -p "$PERF_DIR"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_config.json \
  --output "$PERF_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_config.json"
cvs copy-config inference/inferencex_atom_single/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_threshold.json \
  --output "$PERF_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_threshold.json"

TS=$(date +%Y%m%d_%H%M%S)
HTML="$HOME/cvs_results/${TS}_ix-atom-w1-perf_mi300x.html"
LOG="$HOME/cvs_results/${TS}_ix-atom-w1-perf_mi300x.log"

cvs run inferencex_atom_single \
  --cluster_file ~/input/cluster_file/mi300x_atom_single.json \
  --config_file "$PERF_DIR/mi300x_inferencex-atom-single_deepseek-r1_fp8_perf_config.json" \
  --html="$HTML" --self-contained-html --log-file="$LOG" -vvv -s
  • Lifecycle stages pass (container, model fetch, server start, benchmark client, teardown).
  • Both sweep cells: CONC=128, CONC=256 (ISL=OSL=1024, TP=8); server reused across cells.
  • All metric tiers pass under enforce_thresholds: true.
  • HTML report attached to PR.

MI355X

  • Not required for merge — configs ship; flip enforce_thresholds when hardware is available.

Test Result

MI300X lab — mi300x_inferencex-atom-single_deepseek-r1_fp8_perf

  • Branch: hnimrama/IX-atom @ 07c90a7
  • Launcher: CTR-SVDT-L005 (10.7.54.167) · GPU node: 10.245.135.75
  • Install: make install from repo root (.cvs_venv)
  • Image: rocm/atom-dev:latest
  • Model: deepseek-ai/DeepSeek-R1-0528 (FP8, TP=8)
  • Outcome: 17 passed, 0 failed (~22 min)
Cell client.output_throughput (measured) Threshold (min) Result
CONC=128 2867 tok/s 2580 tok/s PASS
CONC=256 4697 tok/s 4227 tok/s PASS
Cell TTFT / TPOT gates Result
CONC=128 mean TTFT 811 ms (max 892); p99 TTFT 6511 ms (max 7162); mean TPOT 42.5 ms; p99 TPOT 46.7 ms (max 51.4) PASS
CONC=256 mean TTFT 728 ms; p99 TPOT 59.7 ms (max 65.6) PASS

MI300X lab — smoke

  • Outcome: 11 passed, 0 failed (~13 min)

Artifacts (attach to PR)

Unit tests

  • CI / local unit suite green on PR branch.

Submission Checklist

  • PR open: #238
  • Jira linked: AIMVT-236, AIMVT-244
  • Base branch is dev/dtni
  • Lab run used make install from this branch (not a stale site-packages install)
  • MI300X W1 perf + smoke HTML reports attached
  • enforce_thresholds: true only on MI300X W1 perf — confirmed in lab
  • MI355X variants left at enforce_thresholds: false (pending hardware)
  • Plan doc reviewed for milestone alignment
  • Reviewer aware M2 (gsm8k), M3 (W2/W3/W13/W17), and M5 (multi-node) are follow-ups on dev/dtni

@hnimra-amd hnimra-amd marked this pull request as draft June 23, 2026 23:28
@hnimra-amd hnimra-amd marked this pull request as ready for review June 24, 2026 16:28
hnimra-amd added a commit that referenced this pull request Jun 24, 2026
Stacked PR body covering motivation, technical details, test plan, lab results, and checklist targeting dev/dtni.
Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.
Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.
Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.
Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.
Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling.
vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.
Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.
…st polling

Gate benchmark success on Failed requests only after the summary is present;
tail more client log lines for InferenceMax. Variant and benchmark_params accept
bench_max_failed_requests (default 0 remains strict for CI).
Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.
…ase 2)

Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.
…Phase 2)

Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.
…ion 1 (Phase 2)

Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.
Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.
Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.
…_args (Phase 3)

MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.
… (Phase 3)

Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.
…ase 3)

Document Python serve, client.* metrics, and expanded lifecycle test stages.
Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.
InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.
…(Phase 5)

Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.
Verify stock results artifact maps to client.* metrics via FakeOrch.
…ngle

Adopt InferenceX ATOM as the framework identity while the suite is still internal. Renames the driver, config loader, pytest suite, variant configs, and documentation to inferencex_atom_single.
Add Section 1.2 hardware policy, revise M1/Phase A exit criteria, and update milestone diagrams so MI355X confirmation is optional until nodes are available.
…log.

Align branch state and phases with ATOM driver reality, add accuracy test
catalog, metric tiers, and platform enhancements; update W1 README lab notes.
…and metrics.

Document perf variant modes, workload-specific accuracy tests, inferencex_atom_vllm/sglang parity suites, supplemental and MTP metrics, and CI compare keys.
Re-apply merge resolutions from PR #233 alignment (dual threshold layout, vllm_single imports) that were lost when replaying commits onto e47df5a.
Route inferencemax_repo and framework=inferencemax to inferencex_atom with a warning. InferenceMax host jobs remain a placeholder that points callers at inferencex_atom_single.
Add validate_thresholds_cover_sweep() for reuse by vllm_single and inferencex_atom loaders. Optional gated_metrics parameter allows framework-specific SLO sets.
Rename W1 and GPT-OSS variant stems to {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}] and remove per-variant subfolders. Update README and docs with new copy-config paths and W1 threshold gates for per-GPU throughput and tail latencies.
Add inferencex_atom_parsing with IX-specific GATED_METRICS (per_gpu_throughput, output_tput_per_gpu) without changing vllm_single. Wire InferenceXAtomJob and the suite to atom parsing; default metric_percentiles to 95,99 for p95 TPOT and p99 TTFT.
Clarify that W1 GATED_METRICS and output_tput_per_gpu live in inferencex_atom_parsing so vllm_single stays untouched until parity.
Drop vllm_single comparisons and internal threshold tables; keep naming pattern, variant list, and copy/run commands.
Document per-variant ~/input subdirs to avoid ambiguous threshold discovery,
remote launcher vs GPU node prerequisites, and ~/cvs_results output paths.
Elevate scaling to P1 milestone M5 immediately after M4 parity when
hardware and suite recipes support nnodes>1; defer MTP+P2 widen to M6.
Gate p99_tpot_ms instead of absent p95_tpot_ms, skip missing tier metrics in actuals, and recalibrate MI300X perf thresholds from the 2026-06-25 lab run.
Replace per-node calibrated gates with conservative throughput floors and loose latency caps so healthy runs pass across lab nodes without recalibration.

@atnair-amd atnair-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review from five-pass analysis (structure · duplication · unit tests · code quality · live validation run).

Blockers (must fix before merge): false-green on missing config_file · parse_results error paths untested · _client_log_failures untested
Majors: driver default guard · config/threshold structure diverges from vllm · server reuse helpers untested · build_server_cmd suppression untested · _merged_serve_args untested · tier-explosion untested · reuse_server_across_sweep default · no ATOM early-failure detection · wheel not in shared venv
Minors/NITs: see inline comments

No changes are requested without reviewer approval — comments only.

class InferenceXAtomParams(_Forbid):
# ``atom`` uses ATOM openai_server + benchmark_serving; ``vllm`` keeps the
# interim uplift path (vllm serve + vllm bench serve).
driver: Literal["atom", "vllm"] = "vllm"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] driver="vllm" default has no guard — silent misrouting for new ATOM configs

The two gpt-oss-120b configs intentionally omit driver and rely on this default (their serve args are vllm-specific — that is correct). But there is no safeguard for a new config author who adds an ATOM config and forgets the field: they silently get a different binary, different log paths, and a different client — all with exit 0. Suggest adding a Pydantic validator or a conftest-level log warning that fires when framework == "inferencex_atom_single" and driver == "vllm", so the mismatch is visible at run start rather than discovered in logs.

@@ -0,0 +1,80 @@
{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Config files use ix_recipe_id indirection instead of the flat inline structure used by vllm — and threshold_json is absent

vllm configs are fully self-contained: model id, tensor parallelism, dtype, serve args, sweep, and threshold_json are all directly embedded. The ix-atom configs instead carry a single pointer (ix_recipe_id) and defer atom_args (e.g. -tp 8 --kv_cache_dtype fp8 --trust-remote-code), model id, and bench_extra_args to ix_recipes.json, resolved at load time by apply_ix_recipe. This means you cannot understand what a config will actually do without reading a second file.

Additionally, none of the ix-atom configs populate the threshold_json field (inherited from BaseVariantConfig). They rely on the proximity-glob fallback in substitute_config() which raises ValueError if more than one *threshold.json ever exists in the directory. The vllm pattern is explicit: each config carries "threshold_json": "<path>".

Suggested fix: (1) flatten atom_args and key recipe fields inline into the config (as vllm does with roles.server.serve_args), and (2) add "threshold_json": "<path>" to each config file pointing at its paired threshold.

def orch(cluster_dict, variant_config, lifecycle):
container_block = _deep_merge(
cluster_dict.get("container", {}),
orchestrator_container_from_variant(variant_config),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] orchestrator_container_from_variant call site has no comment explaining its side-effect

The function now has a docstring (includes server env), but this call site has no annotation. A future author who copies the conftest and swaps back to variant_config.container.model_dump() silently drops server env vars. Suggest adding: # also injects roles.server.env — do not replace with .container.model_dump()

lifecycle,
request,
):
"""One pytest row per metric tier per sweep cell (W1 gate batches)."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MINOR] test_cell_metrics failure modes are undocumented in the docstring

There are now two pytest.fail paths (lines 185 and 192–195) that are stricter than vllm's silent-return behaviour: (1) no threshold specs exist for a gated tier, and (2) specs exist but none of the expected metrics are present in the actuals. Neither is documented in the docstring, making it hard for future authors to understand the contract. Suggest updating the docstring to describe both failure conditions.



@pytest.fixture(scope="module")
def server_session():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MINOR] server_session fixture has no docstring

This is the only fixture in conftest.py with no vllm analogue and its purpose is non-obvious. Suggest adding: """Tracks the active server session key to allow reuse across sweep cells when reuse_server_across_sweep=true."""

self.orch.exec(f"mkdir -p {shlex.quote(self.out_dir)}")

@classmethod
def _merged_serve_args(cls, variant):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] _merged_serve_args gpu-memory-util promotion is untested

The promotion of CVS_GPU_MEMORY_UTIL from server_env into gpu-memory-utilization in the serve-args dict (lines 172–174) is a separate code path from orchestrator_container_from_variant and has no test. Please add a test asserting the promotion happens when the key is absent from merged, and is a no-op when already present.

with open(config_file) as fp:
raw = json.load(fp)
cases, ids = expand_sweep(raw.get("sweep", {}))
if "metric_tier" in metafunc.fixturenames:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] pytest_generate_tests tier-explosion path is untested

The "metric_tier" in metafunc.fixturenames branch (line 84) and the tier × sweep multiplication (lines 85–92) are a separate code path from expand_sweep (which is tested). The len(cases) * len(METRIC_TIER_ORDER) product is not pinned anywhere. Please add a test using a SimpleNamespace fake metafunc (with a recording .parametrize) asserting the correct number of parametrize calls and IDs for a known sweep config.



def pytest_generate_tests(metafunc):
config_file = metafunc.config.getoption("config_file")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] Missing config_file produces a false-green zero-test run

Lines 78–79: if not config_file or not os.path.isfile(config_file): return causes pytest to collect 0 tests and exit 0 — indistinguishable from a fully passing run. This is a silent failure mode: a typo in --config_file or a missing file will look like success in CI. Please replace the return with pytest.exit("--config_file not found or not specified", returncode=1) or raise pytest.UsageError.



def _reuse_server_flag(params) -> bool:
raw = str(getattr(params, "reuse_server_across_sweep", "true")).strip().lower()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] _reuse_server_flag defaults to "true" when reuse_server_across_sweep is absent from config

getattr(params, "reuse_server_across_sweep", "true") — a config that omits the field silently enables server reuse, which is a surprising default (especially for smoke runs). The same default is set at inferencex_atom_config_loader.py line 68. Suggest changing to "false" (safe: always restart) or making the field required in the schema.

log.info("waiting %ds for server log to materialise", self._precheck_wait)
time.sleep(self._precheck_wait)

if self.driver != "atom":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] No early-failure detection for the ATOM driver during the wait_ready poll loop

Lines 294–298: the EARLY_FAILURE_RE log scan is gated on if self.driver != "atom" — the atom driver skips it entirely. During the smoke run the server was silent for 29 minutes with 28 consecutive health-poll NO responses; a crashed server would be indistinguishable from a loading one until timeout. start_server does check EARLY_FAILURE_RE on the immediate nohup shell output (line 257) but that only catches shell-level launch failures, not server-side startup failures written to the log file during the load window. Suggest adding EARLY_FAILURE_RE pattern matching against the ATOM server log inside the wait_ready poll loop, analogous to the vllm path.

'_cvs_run_bench() { case "$BENCH_KIND" in cli_module) '
'"$BENCH_PY" -m vllm.entrypoints.cli.main bench serve "$@" '
';; *) "$BENCH_PY" "$BENCH_SCRIPT" "$@" ;; esac; }; '
"export BENCH_PY BENCH_KIND BENCH_SCRIPT"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns a long shell string containing embedded Python and may create several problems.

RECOMMENDED

PYTHON_CANDIDATES = ["python3.13", "python3.12", "python3.11", "python3.10", "python3"]

def _build_interpreter_loop(py_q: str, candidates: list[str]) -> str:
"""Build the shell interpreter-discovery loop separately for testability."""
candidate_str = " ".join(candidates)
return (
f"for _cvs_py in {candidate_str}; do\n"
f" _cvs_eval=$($_cvs_py -c {py_q}) || continue\n"
...
)

_inserted = True
if not _inserted:
CLIENT_METRICS.append(_METRIC_INSERT)
CLIENT_METRIC_UNITS = dict(CLIENT_METRICS)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative:

idx = next((i for i, (n, _) in enumerate(_VLLM_CLIENT_METRICS) if n == "per_gpu_throughput"), None)
CLIENT_METRICS = list(_VLLM_CLIENT_METRICS)
CLIENT_METRICS.insert((idx + 1) if idx is not None else len(CLIENT_METRICS), _METRIC_INSERT)

saw_marker = True
continue
if text.isdigit():
total = max(total, int(text))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking max across hosts means the return value is the largest single-host measurement, not the total bytes across the distributed model cache. I think it should be
if text.isdigit():
total += int(text) # sum shards, not max shard

if not rows:
return
try:
import pytest_html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this to the top in the import section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants