fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5
Open
howard989 wants to merge 3 commits into
Open
Conversation
da068b3 to
64578cc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replace the hardcoded free-memory gate in
MilesPipeline._wait_for_overlap_engines_offloaded()with a configurable whole-GPU residual used-memory hard gate.The threshold is controlled by:
Default: 13.0 GiB
Sender side: rlops/miles PR #5.
Why
Per @taoluo review (R02-01): "free memory is gpu-model dependent e.g. 24gb vs 80gb gpu. it would be more robust to check the residual memory allocation."
The base used
target_free_gb = 20.0againstnvidia-smi --query-gpu=memory.free, which is GPU-capacity dependent and not portable. The condition beforewake_upshould be: "the previous tenant released enough GPU memory", i.e. residual used memory.The final gate intentionally uses whole-GPU
memory.usedover the overlap GPUs. This is broader than SGLang process-tree memory: it also catches non-SGLang co-tenants such as Megatron, Miles, vLLM, or orphan processes that can still occupy VRAM and block the next tenant.The paired MILES PR logs SGLang-specific process-resident and
/server_infodiagnostics for attribution, but those diagnostics are not the hard gate.What This PR Does
rlix/utils/env.pyparse_env_positive_floatrlix/pipeline/miles_coordinator.pyMILES_MAX_RESIDUAL_GPU_MEM_GBinto per-pipeline runtime env13.0shrink_engines(post_sleep_vram_threshold_gb=...)rlix/pipeline/miles_pipeline.pytarget_free_gb = 20.0free-memory hard gatestate == offloadedpolling as the liveness gatenvidia-smi --query-gpu=memory.usedover overlap GPUsMILES_MAX_RESIDUAL_GPU_MEM_GBnvidia-smiprobe is unavailableGate Semantics
The hard gate checks:
This is a whole-GPU availability gate, not a SGLang-only attribution gate.
If it fails, the error message explicitly notes that the residual may come from non-SGLang co-tenants:
SGLang process-resident diagnostics are logged engine-side by the paired MILES PR to help determine whether SGLang itself is responsible.
Default 13.0 Rationale
13.0 GiBis a temporary smoke-safe whole-GPU threshold, not a model-derived final value.Recent smokes showed:
A
10.0 GiBthreshold would false-fail current smokes because the known train/co-tenant residual can exceed 10 GiB.13.0 GiBkeeps the whole-GPU gate enabled without failing on the current known Megatron train-offload residual.This is intentionally temporary. Megatron train-offload coverage is tracked as a separate follow-up. After that is fixed, we should re-measure clean whole-GPU residual and lower
MILES_MAX_RESIDUAL_GPU_MEM_GB.Diff Baseline Note
This is a clean branch off latest
zhenyu/miles-mvp-e2e. The closed #11 used intermediate thresholds while we investigated signal choice. This PR's effective change is:Tests
Result:
E2E Verification
Vast dual smoke with paired MILES branch:
Known
SharedStorage actor unavailablewarnings and shutdown-timeRolloutManager500 /RemoteProtocolErrorteardown noise may appear. Training completed, both pipelines reachedshutdown_hard, andEXIT_CODE=0.Scope
Gate signal + configurability only. No model-size-derived threshold. No Megatron train-offload fix in this PR.
Follow-up: fix Megatron train-offload coverage, re-measure clean whole-GPU residual, then lower
MILES_MAX_RESIDUAL_GPU_MEM_GBfrom the temporary13.0 GiB.Refs:
plans/m11-review.review-report/R02.md(R02-01, MEDIUM).