fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01) by howard989 · Pull Request #5 · rlops/miles

howard989 · 2026-05-25T08:38:06Z

What

Replace the hardcoded free-memory gate in MilesPipeline._wait_for_overlap_engines_offloaded() with a configurable whole-GPU residual used-memory hard gate.

The threshold is controlled by:

MILES_MAX_RESIDUAL_GPU_MEM_GB

Default: 13.0 GiB

Sender side: rlops/miles PR #5.

Why

Per @taoluo review (R02-01): "free memory is gpu-model dependent e.g. 24gb vs 80gb gpu. it would be more robust to check the residual memory allocation."

The base used target_free_gb = 20.0 against nvidia-smi --query-gpu=memory.free, which is GPU-capacity dependent and not portable. The condition before wake_up should be: "the previous tenant released enough GPU memory", i.e. residual used memory.

The final gate intentionally uses whole-GPU memory.used over the overlap GPUs. This is broader than SGLang process-tree memory: it also catches non-SGLang co-tenants such as Megatron, Miles, vLLM, or orphan processes that can still occupy VRAM and block the next tenant.

The paired MILES PR logs SGLang-specific process-resident and /server_info diagnostics for attribution, but those diagnostics are not the hard gate.

What This PR Does

rlix/utils/env.py
- adds parse_env_positive_float
rlix/pipeline/miles_coordinator.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB into per-pipeline runtime env
- parses it with default 13.0
- passes it to shrink_engines(post_sleep_vram_threshold_gb=...)
rlix/pipeline/miles_pipeline.py
- removes the old target_free_gb = 20.0 free-memory hard gate
- keeps state == offloaded polling as the liveness gate
- probes whole-GPU nvidia-smi --query-gpu=memory.used over overlap GPUs
- raises if whole-GPU residual exceeds MILES_MAX_RESIDUAL_GPU_MEM_GB
- fail-open if the nvidia-smi probe is unavailable

Gate Semantics

The hard gate checks:

max(nvidia-smi memory.used over overlap GPUs) <= MILES_MAX_RESIDUAL_GPU_MEM_GB

This is a whole-GPU availability gate, not a SGLang-only attribution gate.

If it fails, the error message explicitly notes that the residual may come from non-SGLang co-tenants:

Megatron / Miles / vLLM / orphan processes

SGLang process-resident diagnostics are logged engine-side by the paired MILES PR to help determine whether SGLang itself is responsible.

Default 13.0 Rationale

13.0 GiB is a temporary smoke-safe whole-GPU threshold, not a model-derived final value.

Recent smokes showed:

RTX PRO 6000 96GB:
  whole-GPU residual peak: 11.95-11.97 GiB
  SGLang process-resident diagnostic: 2.516-2.535 GiB

Earlier RTX 5090 smoke:
  whole-GPU residual peak: ~12.5 GiB

A 10.0 GiB threshold would false-fail current smokes because the known train/co-tenant residual can exceed 10 GiB. 13.0 GiB keeps the whole-GPU gate enabled without failing on the current known Megatron train-offload residual.

This is intentionally temporary. Megatron train-offload coverage is tracked as a separate follow-up. After that is fixed, we should re-measure clean whole-GPU residual and lower MILES_MAX_RESIDUAL_GPU_MEM_GB.

Diff Baseline Note

This is a clean branch off latest zhenyu/miles-mvp-e2e. The closed #11 used intermediate thresholds while we investigated signal choice. This PR's effective change is:

20.0 GiB free-memory gate
  -> 13.0 GiB whole-GPU residual used-memory gate

Tests

python -m pytest -q tests/test_env_utils.py tests/test_miles_residual_threshold_wiring.py

Result:

6 passed

E2E Verification

Vast dual smoke with paired MILES branch:

SGLang diagnostic:
process_resident_max=2.516-2.535 GiB
whole_gpu_threshold=13.000 GiB

Whole-GPU hard gate:
whole-GPU mem used max=6.25 / 6.27 GiB across overlap GPUs [0]/[3]
whole-GPU mem used max=11.95 / 11.97 GiB across overlap GPUs [0]/[3]
threshold=13.00 GiB

mp2 training loop complete
mp1 training loop complete
shutdown_hard complete for both pipelines
EXIT_CODE=0

Known SharedStorage actor unavailable warnings and shutdown-time RolloutManager 500 / RemoteProtocolError teardown noise may appear. Training completed, both pipelines reached shutdown_hard, and EXIT_CODE=0.

Scope

Gate signal + configurability only. No model-size-derived threshold. No Megatron train-offload fix in this PR.

Follow-up: fix Megatron train-offload coverage, re-measure clean whole-GPU residual, then lower MILES_MAX_RESIDUAL_GPU_MEM_GB from the temporary 13.0 GiB.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

howard989 added 2 commits May 25, 2026 00:12

fix(miles): forward residual GPU threshold env

756e426

feat(miles): gate shrink on per-GPU resident process memory

64578cc

howard989 mentioned this pull request May 25, 2026

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB) rlops/rlix#17

Open

howard989 force-pushed the howard/m11-forward-residual-gpu-env-v2 branch from da068b3 to 64578cc Compare May 25, 2026 23:15

fix(miles): keep SGLang residual checks diagnostic-only

7efb290

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5
howard989 wants to merge 3 commits into
rlops:zhenyu/m11-mvp-testfrom
howard989:howard/m11-forward-residual-gpu-env-v2

howard989 commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard989 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

What This PR Does

Gate Semantics

Default 13.0 Rationale

Diff Baseline Note

Tests

E2E Verification

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

howard989 commented May 25, 2026 •

edited

Loading