[autorevert] born-red: order-independent detection (pending head + interleaved gaps) by izaitsevfb · Pull Request #8179 · pytorch/test-infra

izaitsevfb · 2026-06-16T00:07:04Z

Summary

Make the born-red test-signal detector (#8107) fire on real trunk signals,
cover the ghstack-landing case where the introducing commit never ran jobs, and
gate the "recovered" case where the test was removed/disabled again.

A test introduced already-broken (added / enabled / un-skipped / moved-into-existence)
has no green to bisect, so the green→red path bails with NO_SUCCESSES and the
advisor is never asked. The old detector required a strict contiguous
[FAIL…][EMPTY…] shape that never occurs in production — 0 dispatches over 11
real born-red episodes in a month of autorevert_state. Real signals carry
pending commits (jobs not finished yet) at the head and interleaved
between the failures and the older baseline, which the contiguous walk rejected.

What it checks

Notation, newest→oldest: F failing · C concluded baseline (job ran, test
absent) · P pending (job running) · U unrun (no job result — ghstack
stack-middle push never scheduled, test didn't run, or test crashed without
reporting).

Test-track only; the caller guarantees no successful run anywhere in the window.
The partition is scoped to the most-recent concluded baseline C (inclusive)
and everything newer — C is positive proof the test was absent, so a C
newer than the failures means the test was removed/disabled again (recovered)
and the stale older failures are out of scope. Within that scope, born-red fires
when:

≥1 failing commit. Suspect = the oldest in-scope F (the introduction point).
≥2 distinct failing commits before the advisor is dispatched (retries /
shards on one commit count as one observation).

has_successes() is still checked over the full window (any success
anywhere disqualifies born-red). P commits are ignored for the suspect
decision — we act on the oldest observed F now instead of waiting. U commits
in the introduction gap (between the suspect and the baseline) are restarted —
bisection / linear coverage honoring BISECTION_LIMIT, reusing the green→red
gap-closure logic — to find the true introducer, with the advisor dispatched in
parallel.

Matches

signal	action
`F F C`	advisor on oldest `F`, baseline `C`
`P P F F P C`	advisor on oldest `F` (pending ignored; no restart)
`F F U C`	restart `U` and advisor on oldest `F`, in parallel
`F U C`	restart `U` only (1 failure < 2 → no advisor yet)
`F F C F C`	advisor on the oldest in-scope `F`; the older `F` is a prior, recovered episode (scoped out by the newer `C`)

Does not act

signal	why
any success in window	not born-red — the green→red path handles it
`C F F C`	recovered: a baseline newer than the failures (test removed/disabled) → out of scope
`F F`	no concluded baseline → can't prove the test was ever absent (chronic)
`F U`	only an unrun commit below → no proof of absence
`C C`	no failures
`F C`	partition forms but advisor held — needs ≥2 distinct failures

How

SignalCommit.job_group_concluded (set in _build_test_signals): the test's
(workflow, job_base) group reached a terminal conclusion (≥1 non-cancelled
run, none pending) — distinguishes a real baseline from "no info yet".
partition_born_red: the scoped set predicate above. failed[-1] = suspect,
successful[0] = the most-recent baseline, unknown = introduction gap.
_handle_no_successes: dispatches the advisor and/or returns RestartCommits
(reusing cover_gap_unknown_commits + GapBisectionPlanner).

Test plan

Full lambda suite green (209, 1 skipped); lintrunner clean. TestBornRedTestSignal
covers each match / no-act row (incl. recovered-at-head and most-recent-baseline
scoping), gap population, restart + advisor parallelism, the ghstack
single-failure case, bisection_limit throttling, and block- / revert-verdict
with an uncovered gap.

Notes

Cross-model review (DP17, Codex gpt-5.5): SHIP / SHIP-WITH-NITS across the
predicate, gate removal, gap-restart, and recovered-at-head gate; nits
addressed. No advisor-prompt change.

…terleaved gaps) The strict `[FAIL...][EMPTY...]` `partition_born_red` (from #8107) never fired on real trunk signals: the newest commits are always in-progress (pending) and the test isn't observed on every commit (sharding / TD), so empty commits are interleaved among the failures. Both defeated the contiguous-shape walk — a month of production `misc.autorevert_state` data shows 0 born-red dispatches across 11 real born-red episodes (e.g. inductor `test_flip_zero_dim_dynamic_shapes_cuda`, 2026-05-21). Replace the shape walk with an order-independent set predicate: - no successes (caller-guaranteed), - >=1 failing commit; suspect = oldest failing commit, - >=1 *concluded baseline* commit (`job_group_concluded` and no events) older than the suspect — positive proof the test was absent there, as opposed to a commit whose jobs are merely still pending and carry no information, - introduction-boundary gate: defer (return None) if an unconcluded commit sits between the suspect and the nearest older baseline (it could itself be the real introducer; dispatching on a too-new suspect makes the advisor read the wrong diff). Add a per-commit `SignalCommit.job_group_concluded` flag, set in `_build_test_signals` (>=1 non-cancelled run for the test's (workflow, job_base) group, none pending). The >=2-distinct-failing-commits threshold and advisor dispatch/dedup are unchanged. Tests: rewrite TestBornRedTestSignal for the new predicate (fires despite leading pending and interleaved empties; defers on a pending introduction gap; requires a concluded baseline), add a regression built from the real flip_zero_dim shape, and a `_build_test_signals` flag test.

vercel · 2026-06-16T00:07:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
torchci	Ignored	Preview	Jun 18, 2026 6:48pm

…tion-boundary defer The introduction-boundary gate deferred (returned None) whenever an unconcluded/pending commit sat between the suspect and the nearest older concluded baseline, to avoid dispatching the advisor on a too-new suspect. On the agreed semantics that is too conservative: for the common real shape P P P F F P P P C P C (newest -> oldest; C = concluded, no test events) the gate hit the first pending below the oldest failure and deferred, so the advisor never fired even though there is a concluded baseline proving the test was absent. Drop the gate: dispatch on the oldest *observed* failure now. Pending commits are ignored wherever they sit (head or introduction gap). If a pending gap commit later concludes to a failure older than the current suspect, the suspect moves and a second advisor runs on the corrected commit; the stale verdict on the too-new suspect is keyed to that commit (dedup/caps in SignalActionsLogger are per-(commit, signal_key)) and is never consulted for the corrected suspect. On real autorevert_state data this costs <=1 superfluous advisor run per episode, usually 0. Accepted precision/latency tradeoff: a too-new suspect almost always returns not_related, bounded by the advisor confidence gate. Tests: flip test_partition_born_red_defers_on_pending_in_introduction_gap -> fires (suspect = oldest observed failure); add the full P P P F F P P P C P C regression.

…k coverage) When a ghstack of multiple commits lands and introduces an already-broken test, `push` only schedules jobs for the stack head, so the stack-middle commits are *unrun* (no jobs, no events) — never observable on their own. Born-red could see the head fail but had no way to bisect which commit in the stack actually introduced the test. Reuse the green->red gap-closure machinery for the born-red path: - partition_born_red now populates `unknown` with the introduction gap: commits strictly between the suspect (oldest observed failure) and the newest concluded baseline. Unrun commits there are bisection candidates; pending commits are covered separators. - _handle_no_successes calls cover_gap_unknown_commits(bisection_limit) (the same GapBisectionPlanner the main path uses) and returns RestartCommits to schedule the bisection/linear coverage. This fires independent of the >=2-failure advisor gate — the ghstack case starts with a single observed failure and needs the restarts to gather coverage. - Option A: the advisor dispatches on the oldest observed failure *in parallel* with the gap restarts (RestartCommits carries the DispatchAdvisor). A not_related/garbage verdict on the suspect doesn't suppress the signal while the gap still has uncovered commits — we keep covering it. Thresholds unchanged: >=1 concluded baseline older than the suspect, >=2 distinct failing commits before advisor dispatch. Tests: gap population (unrun + pending), process-level restart-with-advisor, the ghstack single-failure case, pending-only gap (no restart), bisection-limit throttling, mixed pending/unrun, and block-/revert-verdict + uncovered gap (the latter documents the accepted Option-A revert-side tradeoff). Full lambda suite green (208), lintrunner clean.

The motivation/docstring claimed empty commits are interleaved among the failures due to sharding / TD — that's wrong: sharding and TD do not affect which tests are recorded per commit. The real-world interleaving we observed is *pending* commits (jobs not finished yet) sitting between the failures and the older baseline. And the sources of an *unrun* gap commit (no job result) are: a job never scheduled (ghstack lands a stack, `push` only runs the stack head), the test did not run, or it crashed without reporting. Doc-only.

Describe the current intended behavior + rationale concisely; drop the narrative of how the detector evolved. Doc-only.

Scope the partition to the most-recent concluded baseline (inclusive) and everything newer. A concluded baseline (job group ran, test absent) NEWER than the failures means the test was removed / disabled again — recovered — so the stale failures below it must not trigger a born-red dispatch. `has_successes()` still checks the full window (any success anywhere disqualifies born-red). This is the born-red analog of detect_fixed()'s recovered-at-head guard, which can't catch this case because born-red has no success event to key off. Consequence: a concluded baseline BETWEEN two failures is now a recovery boundary, not a 'test not observed here' gap — the partition scopes to the newer episode. (Real-world interleaving between failures is *pending* commits, which are unaffected.) Tests: add recovered-at-head (None) and scoped-to-most-recent-baseline; update canonical / pending-head-and-gap to the single most-recent baseline; fix the real-flip-shape regression to use a pending interleaved commit (matches real data) instead of a concluded baseline. Full lambda suite green (209).

jeanschmidt · 2026-06-17T20:23:30Z

@claude

Can you verify the implementation is correct, point to any edge cases not being handled, find where the code could be improved.

claude · 2026-06-17T20:24:01Z

Claude finished @jeanschmidt's task in 11m 53s —— View job

Review: born-red order-independent detection

I read signal.py, signal_extraction.py, job_agg_index.py, and both test files. Overall this is a careful, well-documented change with strong test coverage — the predicate, scoping, gap-restart, recovered-at-head gate, and ghstack single-failure path are all exercised. The core partition_born_red logic is sound. I found one correctness edge case worth fixing and a few smaller notes.

✅ What's correct

partition_born_red scoping. baseline_idx = first (newest) job_group_concluded and not events commit; window = commits[:baseline_idx+1]; suspect_idx = max(i for failing in window). Because baseline_idx is the newest concluded-empty commit and the suspect is the oldest failure newer than it, the gap window[suspect_idx+1:-1] provably contains only pending/unrun commits (no other failures, no other concluded baseline). The layout matches the docstring exactly.
Recovered-at-head gate. A baseline newer than all failures makes window hold no failure → None. Consistent with detect_fixed() (which skips the events-less head and returns False on the first failure, so born-red still owns the decision). Good.
job_group_concluded ⟺ no pending events holds: group_pending and the PENDING-event append are both driven by meta.is_pending, so a concluded commit never carries a pending event and the not c.events baseline test is exact.
cover_gap_unknown_commits correctly treats pending gap commits (which carry a PENDING event) as covered separators and unrun ones (no events) as restart candidates.

⚠️ Edge case: skipped job groups create a false born-red baseline

signal_extraction.py:427 only treats cancelled attempts as missing when computing job_group_concluded:

if meta.is_cancelled:
    continue
group_runs += 1
if meta.is_pending:
    group_pending = True

A fully-skipped job group has is_cancelled=False, is_pending=False, is_skipped=True, and produces no event (JobMeta.status → None). It therefore increments group_runs without setting group_pending, so the commit ends up job_group_concluded=True with empty events — i.e. it is misread as a born-red baseline witness (positive proof the test was absent).

But a skipped job did not run the test, so it is not proof of absence — job_agg_index.py:43-48 and the test_skipped_attempt_yields_no_event regression both establish the intended semantics: skipped is treated the same as cancelled / "missing". That invariant is broken here for the new flag.

Concrete impact: a commit where the test's required job was skipped (an if: gate, or a required-check skip when an upstream dep failed/cancelled) becomes a baseline. That can (a) manufacture a born-red detection where the only "baseline" is a skipped commit, and (b) wrongly move the scope boundary, scoping out real older failures. Low frequency, but it dispatches the advisor / a revert on a false premise.

Suggested fix — mirror the existing cancelled handling:

if meta.is_cancelled or meta.is_skipped:
    # cancelled/skipped attempts are missing — neither counts as the group
    # "concluding" (a skipped job never ran the test, so it is not proof the
    # test was absent).
    continue

A targeted extraction test (skipped group must yield job_group_concluded=False) would lock this in. Fix this →

Minor notes (non-blocking)

Option-A trade-off (acting on the oldest observed failure even if an unrun gap commit below could be a truer introducer) is well-reasoned and documented at signal.py:714-722. The bound (advisor only returns revert when it sees the introduction in that diff + the confidence gate) is reasonable, and the per-(commit, signal) verdict keying means a later-concluding older commit just relocates the suspect. Calling out explicitly that the stale verdict on the too-new commit is never reused is good — no change needed.
Block-verdict fall-through (signal.py:751-757): when the suspect already has a not_related/garbage verdict but the gap still has uncovered commits, the method drops the block and returns RestartCommits(advisor=None) to keep bisecting. Intentional per the docstring, but it does mean a garbage 2h-suppression is silently discarded for that tick. Acceptable since no new advisor is dispatched; worth a one-line inline comment noting the suppression is deliberately not surfaced while the gap is open.
Readability: partition_born_red returns successful=[self.commits[baseline_idx]] while the green→red path uses a multi-commit successful list. The single-element list works with successful[0], but a short comment that born-red intentionally carries exactly one baseline (already implied) would help the next reader reconcile the two partition producers.

Test plan

I could not execute the suite here (python/pytest are outside the permitted command set in this environment), so I verified by reading. The PR states the full lambda suite is green (209, 1 skipped) and lintrunner is clean; the born-red and extraction tests I read line-by-line are consistent with the implementation. If you apply the skipped-group fix, please re-run test_signal_extraction.py and add the suggested skipped-baseline case.

Summary: logic is correct for the cancelled/pending/unrun cases it models; the one real gap is that skipped job groups aren't excluded from job_group_concluded, which can fabricate a baseline. Recommend the small signal_extraction.py:427 fix + a regression test before merge; the rest are optional polish.
· branch iz/autorevert-born-red-set-predicate

jeanschmidt · 2026-06-17T20:47:02Z

I am not sure if the case that claude discussed is a case we can actually experience on pytorch, but yeah, a test that runs on a conditional job might be an edge case we're not handling. Do you think this is something we should be handling?

…luded A fully-skipped job group (is_skipped: an `if:` gate, or a required-check skip when an upstream dependency failed/cancelled) produces no event and JobMeta.status -> None — it is 'missing', same as cancelled. But the job_group_concluded computation only excluded is_cancelled, so a skipped group incremented group_runs without pending and became a concluded-empty commit = a false born-red baseline witness (positive 'proof' the test was absent when it simply was never run there). Mirror the existing cancelled handling: treat cancelled OR skipped as missing. Skipped test jobs are pervasive on trunk (conditionally-gated configs), so this otherwise fabricates baselines / mis-scopes the partition on many commits. Found by the PR's Claude review. Adds test_job_group_concluded_excludes_skipped_group. Full lambda suite green (210).

…ude workflow_dispatch) A born-red baseline ('test was genuinely absent here') must be proven by a NATURAL (push/schedule) run. Autorevert's own restarts are workflow_dispatch runs that are job/test-filtered: the job concludes but runs only the filtered test(s), so for any OTHER test sharing the (workflow, job_base) such a run concludes with no event — which the v2 extractor was counting as a false 'test absent' baseline. (The old autorevert_checker already excluded workflow_dispatch; the v2 datasource query did not.) Thread workflow_event through JobRow -> JobMeta.is_workflow_dispatch, and in _build_test_signals exclude workflow_dispatch runs from group_runs/group_pending so they cannot establish job_group_concluded. Dispatch runs STILL emit their outcome/pending events, so a gap-restart that the test fails still surfaces a FAILURE and moves the suspect, and an in-flight restart still acts as a covered separator (no duplicate restart). Bisection still converges: failures move the suspect via emitted events; a dispatch that confirms absence no longer shrinks the gap (it used to via the false baseline), so that commit stays restart-eligible, but churn is hard-bounded by the existing per-(workflow,commit) restart cap (5) + 20min pacing + 20-60min backoff. Recovered-at-head is also no longer falsely triggered by a filtered dispatch concluded-empty at the head. Tests: test_workflow_dispatch_run_does_not_establish_baseline (dispatch-only concluded + no event -> not a baseline -> not born-red) and test_workflow_dispatch_failure_still_emits_event (dispatch failure still emits, still doesn't conclude the group). Full lambda suite green (212). Addresses the second half of the PR's Claude review + Ivan's dispatch-filter point.

…atch_run) After excluding workflow_dispatch runs from establishing a baseline, a probed-but-absent gap commit looked identical to never-run (no events, job_group_concluded=False), so cover_gap re-selected it every tick and the same commit got re-dispatched up to the soft action-layer cap (5). That churn is not acceptable — a gap commit should be probed at most once. Record the presence of a restart on the commit explicitly: new SignalCommit.has_dispatch_run, set in _build_test_signals when a non-cancelled/ non-skipped workflow_dispatch run exists for the test's job group (pending or concluded). cover_gap_unknown_commits now treats such a commit as a covered separator (alongside pending), so it is never re-selected for restart: covered = [bool(c.events) or c.has_dispatch_run for c in self.unknown] A first probe that finds the test failing still emits a FAILURE event (the commit leaves the gap and moves the suspect); a probe that finds it absent emits no event but has_dispatch_run keeps it covered — so exactly one dispatch per gap commit. Cancelled/skipped dispatch runs are ignored (they did not run), so a genuinely-failed dispatch can still be retried. has_dispatch_run is job-group level, so a sibling test's restart on the same commit also marks it probed — at worst an under-probe (suspect stays too-new), never a false baseline or wrong revert (no event/baseline is created). Tests: cover_gap skips an already-dispatched gap commit and restarts only the unprobed one; all-dispatched gap → no restart, advisor still dispatches; extraction sets has_dispatch_run from a workflow_dispatch run (False for natural-only); replace()/sentinels updated. Full lambda suite green (215). Addresses Ivan's 'prevent multiple dispatches on the same commit'.

pytorch-bot Bot added the ci-no-td label Jun 16, 2026

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2026

Ivan added 5 commits June 16, 2026 13:50

[autorevert] born-red: tighten partition_born_red docstring

7f4f6cb

Describe the current intended behavior + rationale concisely; drop the narrative of how the detector evolved. Doc-only.

Ivan added 3 commits June 17, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[autorevert] born-red: order-independent detection (pending head + interleaved gaps)#8179

[autorevert] born-red: order-independent detection (pending head + interleaved gaps)#8179
izaitsevfb wants to merge 9 commits into
mainfrom
iz/autorevert-born-red-set-predicate

izaitsevfb commented Jun 16, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

jeanschmidt commented Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

jeanschmidt commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

izaitsevfb commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it checks

Matches

Does not act

How

Test plan

Notes

Uh oh!

vercel Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeanschmidt commented Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: born-red order-independent detection

✅ What's correct

⚠️ Edge case: skipped job groups create a false born-red baseline

Minor notes (non-blocking)

Test plan

Uh oh!

jeanschmidt commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

izaitsevfb commented Jun 16, 2026 •

edited

Loading

vercel Bot commented Jun 16, 2026 •

edited

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading