[autorevert] born-red: order-independent detection (pending head + interleaved gaps)#8179
[autorevert] born-red: order-independent detection (pending head + interleaved gaps)#8179izaitsevfb wants to merge 9 commits into
Conversation
…terleaved gaps) The strict `[FAIL...][EMPTY...]` `partition_born_red` (from #8107) never fired on real trunk signals: the newest commits are always in-progress (pending) and the test isn't observed on every commit (sharding / TD), so empty commits are interleaved among the failures. Both defeated the contiguous-shape walk — a month of production `misc.autorevert_state` data shows 0 born-red dispatches across 11 real born-red episodes (e.g. inductor `test_flip_zero_dim_dynamic_shapes_cuda`, 2026-05-21). Replace the shape walk with an order-independent set predicate: - no successes (caller-guaranteed), - >=1 failing commit; suspect = oldest failing commit, - >=1 *concluded baseline* commit (`job_group_concluded` and no events) older than the suspect — positive proof the test was absent there, as opposed to a commit whose jobs are merely still pending and carry no information, - introduction-boundary gate: defer (return None) if an unconcluded commit sits between the suspect and the nearest older baseline (it could itself be the real introducer; dispatching on a too-new suspect makes the advisor read the wrong diff). Add a per-commit `SignalCommit.job_group_concluded` flag, set in `_build_test_signals` (>=1 non-cancelled run for the test's (workflow, job_base) group, none pending). The >=2-distinct-failing-commits threshold and advisor dispatch/dedup are unchanged. Tests: rewrite TestBornRedTestSignal for the new predicate (fires despite leading pending and interleaved empties; defers on a pending introduction gap; requires a concluded baseline), add a regression built from the real flip_zero_dim shape, and a `_build_test_signals` flag test.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
…tion-boundary defer
The introduction-boundary gate deferred (returned None) whenever an
unconcluded/pending commit sat between the suspect and the nearest older
concluded baseline, to avoid dispatching the advisor on a too-new suspect.
On the agreed semantics that is too conservative: for the common real shape
P P P F F P P P C P C (newest -> oldest; C = concluded, no test events)
the gate hit the first pending below the oldest failure and deferred, so the
advisor never fired even though there is a concluded baseline proving the test
was absent.
Drop the gate: dispatch on the oldest *observed* failure now. Pending commits
are ignored wherever they sit (head or introduction gap). If a pending gap
commit later concludes to a failure older than the current suspect, the
suspect moves and a second advisor runs on the corrected commit; the stale
verdict on the too-new suspect is keyed to that commit (dedup/caps in
SignalActionsLogger are per-(commit, signal_key)) and is never consulted for
the corrected suspect. On real autorevert_state data this costs <=1 superfluous
advisor run per episode, usually 0. Accepted precision/latency tradeoff: a
too-new suspect almost always returns not_related, bounded by the advisor
confidence gate.
Tests: flip test_partition_born_red_defers_on_pending_in_introduction_gap ->
fires (suspect = oldest observed failure); add the full P P P F F P P P C P C
regression.
…k coverage) When a ghstack of multiple commits lands and introduces an already-broken test, `push` only schedules jobs for the stack head, so the stack-middle commits are *unrun* (no jobs, no events) — never observable on their own. Born-red could see the head fail but had no way to bisect which commit in the stack actually introduced the test. Reuse the green->red gap-closure machinery for the born-red path: - partition_born_red now populates `unknown` with the introduction gap: commits strictly between the suspect (oldest observed failure) and the newest concluded baseline. Unrun commits there are bisection candidates; pending commits are covered separators. - _handle_no_successes calls cover_gap_unknown_commits(bisection_limit) (the same GapBisectionPlanner the main path uses) and returns RestartCommits to schedule the bisection/linear coverage. This fires independent of the >=2-failure advisor gate — the ghstack case starts with a single observed failure and needs the restarts to gather coverage. - Option A: the advisor dispatches on the oldest observed failure *in parallel* with the gap restarts (RestartCommits carries the DispatchAdvisor). A not_related/garbage verdict on the suspect doesn't suppress the signal while the gap still has uncovered commits — we keep covering it. Thresholds unchanged: >=1 concluded baseline older than the suspect, >=2 distinct failing commits before advisor dispatch. Tests: gap population (unrun + pending), process-level restart-with-advisor, the ghstack single-failure case, pending-only gap (no restart), bisection-limit throttling, mixed pending/unrun, and block-/revert-verdict + uncovered gap (the latter documents the accepted Option-A revert-side tradeoff). Full lambda suite green (208), lintrunner clean.
The motivation/docstring claimed empty commits are interleaved among the failures due to sharding / TD — that's wrong: sharding and TD do not affect which tests are recorded per commit. The real-world interleaving we observed is *pending* commits (jobs not finished yet) sitting between the failures and the older baseline. And the sources of an *unrun* gap commit (no job result) are: a job never scheduled (ghstack lands a stack, `push` only runs the stack head), the test did not run, or it crashed without reporting. Doc-only.
Describe the current intended behavior + rationale concisely; drop the narrative of how the detector evolved. Doc-only.
Scope the partition to the most-recent concluded baseline (inclusive) and everything newer. A concluded baseline (job group ran, test absent) NEWER than the failures means the test was removed / disabled again — recovered — so the stale failures below it must not trigger a born-red dispatch. `has_successes()` still checks the full window (any success anywhere disqualifies born-red). This is the born-red analog of detect_fixed()'s recovered-at-head guard, which can't catch this case because born-red has no success event to key off. Consequence: a concluded baseline BETWEEN two failures is now a recovery boundary, not a 'test not observed here' gap — the partition scopes to the newer episode. (Real-world interleaving between failures is *pending* commits, which are unaffected.) Tests: add recovered-at-head (None) and scoped-to-most-recent-baseline; update canonical / pending-head-and-gap to the single most-recent baseline; fix the real-flip-shape regression to use a pending interleaved commit (matches real data) instead of a concluded baseline. Full lambda suite green (209).
|
Can you verify the implementation is correct, point to any edge cases not being handled, find where the code could be improved. |
|
Claude finished @jeanschmidt's task in 11m 53s —— View job Review: born-red order-independent detectionI read ✅ What's correct
|
|
I am not sure if the case that claude discussed is a case we can actually experience on pytorch, but yeah, a test that runs on a conditional job might be an edge case we're not handling. Do you think this is something we should be handling? |
…luded A fully-skipped job group (is_skipped: an `if:` gate, or a required-check skip when an upstream dependency failed/cancelled) produces no event and JobMeta.status -> None — it is 'missing', same as cancelled. But the job_group_concluded computation only excluded is_cancelled, so a skipped group incremented group_runs without pending and became a concluded-empty commit = a false born-red baseline witness (positive 'proof' the test was absent when it simply was never run there). Mirror the existing cancelled handling: treat cancelled OR skipped as missing. Skipped test jobs are pervasive on trunk (conditionally-gated configs), so this otherwise fabricates baselines / mis-scopes the partition on many commits. Found by the PR's Claude review. Adds test_job_group_concluded_excludes_skipped_group. Full lambda suite green (210).
…ude workflow_dispatch)
A born-red baseline ('test was genuinely absent here') must be proven by a
NATURAL (push/schedule) run. Autorevert's own restarts are workflow_dispatch
runs that are job/test-filtered: the job concludes but runs only the filtered
test(s), so for any OTHER test sharing the (workflow, job_base) such a run
concludes with no event — which the v2 extractor was counting as a false
'test absent' baseline. (The old autorevert_checker already excluded
workflow_dispatch; the v2 datasource query did not.)
Thread workflow_event through JobRow -> JobMeta.is_workflow_dispatch, and in
_build_test_signals exclude workflow_dispatch runs from group_runs/group_pending
so they cannot establish job_group_concluded. Dispatch runs STILL emit their
outcome/pending events, so a gap-restart that the test fails still surfaces a
FAILURE and moves the suspect, and an in-flight restart still acts as a covered
separator (no duplicate restart).
Bisection still converges: failures move the suspect via emitted events; a
dispatch that confirms absence no longer shrinks the gap (it used to via the
false baseline), so that commit stays restart-eligible, but churn is hard-bounded
by the existing per-(workflow,commit) restart cap (5) + 20min pacing + 20-60min
backoff. Recovered-at-head is also no longer falsely triggered by a filtered
dispatch concluded-empty at the head.
Tests: test_workflow_dispatch_run_does_not_establish_baseline (dispatch-only
concluded + no event -> not a baseline -> not born-red) and
test_workflow_dispatch_failure_still_emits_event (dispatch failure still emits,
still doesn't conclude the group). Full lambda suite green (212).
Addresses the second half of the PR's Claude review + Ivan's dispatch-filter point.
…atch_run)
After excluding workflow_dispatch runs from establishing a baseline, a
probed-but-absent gap commit looked identical to never-run (no events,
job_group_concluded=False), so cover_gap re-selected it every tick and the
same commit got re-dispatched up to the soft action-layer cap (5). That churn
is not acceptable — a gap commit should be probed at most once.
Record the presence of a restart on the commit explicitly: new
SignalCommit.has_dispatch_run, set in _build_test_signals when a non-cancelled/
non-skipped workflow_dispatch run exists for the test's job group (pending or
concluded). cover_gap_unknown_commits now treats such a commit as a covered
separator (alongside pending), so it is never re-selected for restart:
covered = [bool(c.events) or c.has_dispatch_run for c in self.unknown]
A first probe that finds the test failing still emits a FAILURE event (the
commit leaves the gap and moves the suspect); a probe that finds it absent
emits no event but has_dispatch_run keeps it covered — so exactly one dispatch
per gap commit. Cancelled/skipped dispatch runs are ignored (they did not run),
so a genuinely-failed dispatch can still be retried. has_dispatch_run is
job-group level, so a sibling test's restart on the same commit also marks it
probed — at worst an under-probe (suspect stays too-new), never a false
baseline or wrong revert (no event/baseline is created).
Tests: cover_gap skips an already-dispatched gap commit and restarts only the
unprobed one; all-dispatched gap → no restart, advisor still dispatches;
extraction sets has_dispatch_run from a workflow_dispatch run (False for
natural-only); replace()/sentinels updated. Full lambda suite green (215).
Addresses Ivan's 'prevent multiple dispatches on the same commit'.
Summary
Make the born-red test-signal detector (#8107) fire on real trunk signals,
cover the ghstack-landing case where the introducing commit never ran jobs, and
gate the "recovered" case where the test was removed/disabled again.
A test introduced already-broken (added / enabled / un-skipped / moved-into-existence)
has no green to bisect, so the green→red path bails with
NO_SUCCESSESand theadvisor is never asked. The old detector required a strict contiguous
[FAIL…][EMPTY…]shape that never occurs in production — 0 dispatches over 11real born-red episodes in a month of
autorevert_state. Real signals carrypending commits (jobs not finished yet) at the head and interleaved
between the failures and the older baseline, which the contiguous walk rejected.
What it checks
Notation, newest→oldest:
Ffailing ·Cconcluded baseline (job ran, testabsent) ·
Ppending (job running) ·Uunrun (no job result — ghstackstack-middle
pushnever scheduled, test didn't run, or test crashed withoutreporting).
Test-track only; the caller guarantees no successful run anywhere in the window.
The partition is scoped to the most-recent concluded baseline
C(inclusive)and everything newer —
Cis positive proof the test was absent, so aCnewer than the failures means the test was removed/disabled again (recovered)
and the stale older failures are out of scope. Within that scope, born-red fires
when:
F(the introduction point).shards on one commit count as one observation).
has_successes()is still checked over the full window (any successanywhere disqualifies born-red).
Pcommits are ignored for the suspectdecision — we act on the oldest observed
Fnow instead of waiting.Ucommitsin the introduction gap (between the suspect and the baseline) are restarted —
bisection / linear coverage honoring
BISECTION_LIMIT, reusing the green→redgap-closure logic — to find the true introducer, with the advisor dispatched in
parallel.
Matches
F F CF, baselineCP P F F P CF(pending ignored; no restart)F F U CUand advisor on oldestF, in parallelF U CUonly (1 failure < 2 → no advisor yet)F F C F CF; the olderFis a prior, recovered episode (scoped out by the newerC)Does not act
C F F CF FF UC CF CHow
SignalCommit.job_group_concluded(set in_build_test_signals): the test's(workflow, job_base)group reached a terminal conclusion (≥1 non-cancelledrun, none pending) — distinguishes a real baseline from "no info yet".
partition_born_red: the scoped set predicate above.failed[-1]= suspect,successful[0]= the most-recent baseline,unknown= introduction gap._handle_no_successes: dispatches the advisor and/or returnsRestartCommits(reusing
cover_gap_unknown_commits+GapBisectionPlanner).Test plan
Full lambda suite green (209, 1 skipped); lintrunner clean.
TestBornRedTestSignalcovers each match / no-act row (incl. recovered-at-head and most-recent-baseline
scoping), gap population, restart + advisor parallelism, the ghstack
single-failure case,
bisection_limitthrottling, and block- / revert-verdictwith an uncovered gap.
Notes
Cross-model review (DP17, Codex gpt-5.5): SHIP / SHIP-WITH-NITS across the
predicate, gate removal, gap-restart, and recovered-at-head gate; nits
addressed. No advisor-prompt change.