Skip to content

[autorevert] born-red: order-independent detection (pending head + interleaved gaps)#8179

Open
izaitsevfb wants to merge 9 commits into
mainfrom
iz/autorevert-born-red-set-predicate
Open

[autorevert] born-red: order-independent detection (pending head + interleaved gaps)#8179
izaitsevfb wants to merge 9 commits into
mainfrom
iz/autorevert-born-red-set-predicate

Conversation

@izaitsevfb

@izaitsevfb izaitsevfb commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Make the born-red test-signal detector (#8107) fire on real trunk signals,
cover the ghstack-landing case where the introducing commit never ran jobs, and
gate the "recovered" case where the test was removed/disabled again.

A test introduced already-broken (added / enabled / un-skipped / moved-into-existence)
has no green to bisect, so the green→red path bails with NO_SUCCESSES and the
advisor is never asked. The old detector required a strict contiguous
[FAIL…][EMPTY…] shape that never occurs in production — 0 dispatches over 11
real born-red episodes
in a month of autorevert_state. Real signals carry
pending commits (jobs not finished yet) at the head and interleaved
between the failures and the older baseline, which the contiguous walk rejected.

What it checks

Notation, newest→oldest: F failing · C concluded baseline (job ran, test
absent) · P pending (job running) · U unrun (no job result — ghstack
stack-middle push never scheduled, test didn't run, or test crashed without
reporting).

Test-track only; the caller guarantees no successful run anywhere in the window.
The partition is scoped to the most-recent concluded baseline C (inclusive)
and everything newer
C is positive proof the test was absent, so a C
newer than the failures means the test was removed/disabled again (recovered)
and the stale older failures are out of scope. Within that scope, born-red fires
when:

  1. ≥1 failing commit. Suspect = the oldest in-scope F (the introduction point).
  2. ≥2 distinct failing commits before the advisor is dispatched (retries /
    shards on one commit count as one observation).

has_successes() is still checked over the full window (any success
anywhere disqualifies born-red). P commits are ignored for the suspect
decision — we act on the oldest observed F now instead of waiting. U commits
in the introduction gap (between the suspect and the baseline) are restarted —
bisection / linear coverage honoring BISECTION_LIMIT, reusing the green→red
gap-closure logic — to find the true introducer, with the advisor dispatched in
parallel.

Matches

signal action
F F C advisor on oldest F, baseline C
P P F F P C advisor on oldest F (pending ignored; no restart)
F F U C restart U and advisor on oldest F, in parallel
F U C restart U only (1 failure < 2 → no advisor yet)
F F C F C advisor on the oldest in-scope F; the older F is a prior, recovered episode (scoped out by the newer C)

Does not act

signal why
any success in window not born-red — the green→red path handles it
C F F C recovered: a baseline newer than the failures (test removed/disabled) → out of scope
F F no concluded baseline → can't prove the test was ever absent (chronic)
F U only an unrun commit below → no proof of absence
C C no failures
F C partition forms but advisor held — needs ≥2 distinct failures

How

  • SignalCommit.job_group_concluded (set in _build_test_signals): the test's
    (workflow, job_base) group reached a terminal conclusion (≥1 non-cancelled
    run, none pending) — distinguishes a real baseline from "no info yet".
  • partition_born_red: the scoped set predicate above. failed[-1] = suspect,
    successful[0] = the most-recent baseline, unknown = introduction gap.
  • _handle_no_successes: dispatches the advisor and/or returns RestartCommits
    (reusing cover_gap_unknown_commits + GapBisectionPlanner).

Test plan

Full lambda suite green (209, 1 skipped); lintrunner clean. TestBornRedTestSignal
covers each match / no-act row (incl. recovered-at-head and most-recent-baseline
scoping), gap population, restart + advisor parallelism, the ghstack
single-failure case, bisection_limit throttling, and block- / revert-verdict
with an uncovered gap.

Notes

Cross-model review (DP17, Codex gpt-5.5): SHIP / SHIP-WITH-NITS across the
predicate, gate removal, gap-restart, and recovered-at-head gate; nits
addressed. No advisor-prompt change.

…terleaved gaps)

The strict `[FAIL...][EMPTY...]` `partition_born_red` (from #8107) never fired on
real trunk signals: the newest commits are always in-progress (pending) and the
test isn't observed on every commit (sharding / TD), so empty commits are
interleaved among the failures. Both defeated the contiguous-shape walk — a
month of production `misc.autorevert_state` data shows 0 born-red dispatches
across 11 real born-red episodes (e.g. inductor `test_flip_zero_dim_dynamic_shapes_cuda`,
2026-05-21).

Replace the shape walk with an order-independent set predicate:
- no successes (caller-guaranteed),
- >=1 failing commit; suspect = oldest failing commit,
- >=1 *concluded baseline* commit (`job_group_concluded` and no events) older
  than the suspect — positive proof the test was absent there, as opposed to a
  commit whose jobs are merely still pending and carry no information,
- introduction-boundary gate: defer (return None) if an unconcluded commit sits
  between the suspect and the nearest older baseline (it could itself be the
  real introducer; dispatching on a too-new suspect makes the advisor read the
  wrong diff).

Add a per-commit `SignalCommit.job_group_concluded` flag, set in
`_build_test_signals` (>=1 non-cancelled run for the test's (workflow,
job_base) group, none pending). The >=2-distinct-failing-commits threshold and
advisor dispatch/dedup are unchanged.

Tests: rewrite TestBornRedTestSignal for the new predicate (fires despite
leading pending and interleaved empties; defers on a pending introduction gap;
requires a concluded baseline), add a regression built from the real
flip_zero_dim shape, and a `_build_test_signals` flag test.
@pytorch-bot pytorch-bot Bot added the ci-no-td label Jun 16, 2026
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2026
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
torchci Ignored Ignored Preview Jun 18, 2026 6:48pm

Request Review

Ivan added 5 commits June 16, 2026 13:50
…tion-boundary defer

The introduction-boundary gate deferred (returned None) whenever an
unconcluded/pending commit sat between the suspect and the nearest older
concluded baseline, to avoid dispatching the advisor on a too-new suspect.
On the agreed semantics that is too conservative: for the common real shape

    P P P F F P P P C P C   (newest -> oldest; C = concluded, no test events)

the gate hit the first pending below the oldest failure and deferred, so the
advisor never fired even though there is a concluded baseline proving the test
was absent.

Drop the gate: dispatch on the oldest *observed* failure now. Pending commits
are ignored wherever they sit (head or introduction gap). If a pending gap
commit later concludes to a failure older than the current suspect, the
suspect moves and a second advisor runs on the corrected commit; the stale
verdict on the too-new suspect is keyed to that commit (dedup/caps in
SignalActionsLogger are per-(commit, signal_key)) and is never consulted for
the corrected suspect. On real autorevert_state data this costs <=1 superfluous
advisor run per episode, usually 0. Accepted precision/latency tradeoff: a
too-new suspect almost always returns not_related, bounded by the advisor
confidence gate.

Tests: flip test_partition_born_red_defers_on_pending_in_introduction_gap ->
fires (suspect = oldest observed failure); add the full P P P F F P P P C P C
regression.
…k coverage)

When a ghstack of multiple commits lands and introduces an already-broken
test, `push` only schedules jobs for the stack head, so the stack-middle
commits are *unrun* (no jobs, no events) — never observable on their own.
Born-red could see the head fail but had no way to bisect which commit in the
stack actually introduced the test.

Reuse the green->red gap-closure machinery for the born-red path:

- partition_born_red now populates `unknown` with the introduction gap:
  commits strictly between the suspect (oldest observed failure) and the newest
  concluded baseline. Unrun commits there are bisection candidates; pending
  commits are covered separators.
- _handle_no_successes calls cover_gap_unknown_commits(bisection_limit) (the
  same GapBisectionPlanner the main path uses) and returns RestartCommits to
  schedule the bisection/linear coverage. This fires independent of the
  >=2-failure advisor gate — the ghstack case starts with a single observed
  failure and needs the restarts to gather coverage.
- Option A: the advisor dispatches on the oldest observed failure *in parallel*
  with the gap restarts (RestartCommits carries the DispatchAdvisor). A
  not_related/garbage verdict on the suspect doesn't suppress the signal while
  the gap still has uncovered commits — we keep covering it.

Thresholds unchanged: >=1 concluded baseline older than the suspect, >=2
distinct failing commits before advisor dispatch.

Tests: gap population (unrun + pending), process-level restart-with-advisor,
the ghstack single-failure case, pending-only gap (no restart), bisection-limit
throttling, mixed pending/unrun, and block-/revert-verdict + uncovered gap
(the latter documents the accepted Option-A revert-side tradeoff). Full lambda
suite green (208), lintrunner clean.
The motivation/docstring claimed empty commits are interleaved among the
failures due to sharding / TD — that's wrong: sharding and TD do not affect
which tests are recorded per commit. The real-world interleaving we observed is
*pending* commits (jobs not finished yet) sitting between the failures and the
older baseline. And the sources of an *unrun* gap commit (no job result) are: a
job never scheduled (ghstack lands a stack, `push` only runs the stack head),
the test did not run, or it crashed without reporting. Doc-only.
Describe the current intended behavior + rationale concisely; drop the
narrative of how the detector evolved. Doc-only.
Scope the partition to the most-recent concluded baseline (inclusive) and
everything newer. A concluded baseline (job group ran, test absent) NEWER than
the failures means the test was removed / disabled again — recovered — so the
stale failures below it must not trigger a born-red dispatch. `has_successes()`
still checks the full window (any success anywhere disqualifies born-red).

This is the born-red analog of detect_fixed()'s recovered-at-head guard, which
can't catch this case because born-red has no success event to key off.

Consequence: a concluded baseline BETWEEN two failures is now a recovery
boundary, not a 'test not observed here' gap — the partition scopes to the
newer episode. (Real-world interleaving between failures is *pending* commits,
which are unaffected.)

Tests: add recovered-at-head (None) and scoped-to-most-recent-baseline; update
canonical / pending-head-and-gap to the single most-recent baseline; fix the
real-flip-shape regression to use a pending interleaved commit (matches real
data) instead of a concluded baseline. Full lambda suite green (209).
@jeanschmidt

Copy link
Copy Markdown
Contributor

@claude

Can you verify the implementation is correct, point to any edge cases not being handled, find where the code could be improved.

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown

Claude finished @jeanschmidt's task in 11m 53s —— View job


Review: born-red order-independent detection

I read signal.py, signal_extraction.py, job_agg_index.py, and both test files. Overall this is a careful, well-documented change with strong test coverage — the predicate, scoping, gap-restart, recovered-at-head gate, and ghstack single-failure path are all exercised. The core partition_born_red logic is sound. I found one correctness edge case worth fixing and a few smaller notes.

✅ What's correct

  • partition_born_red scoping. baseline_idx = first (newest) job_group_concluded and not events commit; window = commits[:baseline_idx+1]; suspect_idx = max(i for failing in window). Because baseline_idx is the newest concluded-empty commit and the suspect is the oldest failure newer than it, the gap window[suspect_idx+1:-1] provably contains only pending/unrun commits (no other failures, no other concluded baseline). The layout matches the docstring exactly.
  • Recovered-at-head gate. A baseline newer than all failures makes window hold no failure → None. Consistent with detect_fixed() (which skips the events-less head and returns False on the first failure, so born-red still owns the decision). Good.
  • job_group_concluded ⟺ no pending events holds: group_pending and the PENDING-event append are both driven by meta.is_pending, so a concluded commit never carries a pending event and the not c.events baseline test is exact.
  • cover_gap_unknown_commits correctly treats pending gap commits (which carry a PENDING event) as covered separators and unrun ones (no events) as restart candidates.

⚠️ Edge case: skipped job groups create a false born-red baseline

signal_extraction.py:427 only treats cancelled attempts as missing when computing job_group_concluded:

if meta.is_cancelled:
    continue
group_runs += 1
if meta.is_pending:
    group_pending = True

A fully-skipped job group has is_cancelled=False, is_pending=False, is_skipped=True, and produces no event (JobMeta.statusNone). It therefore increments group_runs without setting group_pending, so the commit ends up job_group_concluded=True with empty events — i.e. it is misread as a born-red baseline witness (positive proof the test was absent).

But a skipped job did not run the test, so it is not proof of absence — job_agg_index.py:43-48 and the test_skipped_attempt_yields_no_event regression both establish the intended semantics: skipped is treated the same as cancelled / "missing". That invariant is broken here for the new flag.

Concrete impact: a commit where the test's required job was skipped (an if: gate, or a required-check skip when an upstream dep failed/cancelled) becomes a baseline. That can (a) manufacture a born-red detection where the only "baseline" is a skipped commit, and (b) wrongly move the scope boundary, scoping out real older failures. Low frequency, but it dispatches the advisor / a revert on a false premise.

Suggested fix — mirror the existing cancelled handling:

if meta.is_cancelled or meta.is_skipped:
    # cancelled/skipped attempts are missing — neither counts as the group
    # "concluding" (a skipped job never ran the test, so it is not proof the
    # test was absent).
    continue

A targeted extraction test (skipped group must yield job_group_concluded=False) would lock this in. Fix this →

Minor notes (non-blocking)

  1. Option-A trade-off (acting on the oldest observed failure even if an unrun gap commit below could be a truer introducer) is well-reasoned and documented at signal.py:714-722. The bound (advisor only returns revert when it sees the introduction in that diff + the confidence gate) is reasonable, and the per-(commit, signal) verdict keying means a later-concluding older commit just relocates the suspect. Calling out explicitly that the stale verdict on the too-new commit is never reused is good — no change needed.

  2. Block-verdict fall-through (signal.py:751-757): when the suspect already has a not_related/garbage verdict but the gap still has uncovered commits, the method drops the block and returns RestartCommits(advisor=None) to keep bisecting. Intentional per the docstring, but it does mean a garbage 2h-suppression is silently discarded for that tick. Acceptable since no new advisor is dispatched; worth a one-line inline comment noting the suppression is deliberately not surfaced while the gap is open.

  3. Readability: partition_born_red returns successful=[self.commits[baseline_idx]] while the green→red path uses a multi-commit successful list. The single-element list works with successful[0], but a short comment that born-red intentionally carries exactly one baseline (already implied) would help the next reader reconcile the two partition producers.

Test plan

I could not execute the suite here (python/pytest are outside the permitted command set in this environment), so I verified by reading. The PR states the full lambda suite is green (209, 1 skipped) and lintrunner is clean; the born-red and extraction tests I read line-by-line are consistent with the implementation. If you apply the skipped-group fix, please re-run test_signal_extraction.py and add the suggested skipped-baseline case.


Summary: logic is correct for the cancelled/pending/unrun cases it models; the one real gap is that skipped job groups aren't excluded from job_group_concluded, which can fabricate a baseline. Recommend the small signal_extraction.py:427 fix + a regression test before merge; the rest are optional polish.
· branch iz/autorevert-born-red-set-predicate

@jeanschmidt

Copy link
Copy Markdown
Contributor

I am not sure if the case that claude discussed is a case we can actually experience on pytorch, but yeah, a test that runs on a conditional job might be an edge case we're not handling. Do you think this is something we should be handling?

Ivan added 3 commits June 17, 2026 15:35
…luded

A fully-skipped job group (is_skipped: an `if:` gate, or a required-check skip
when an upstream dependency failed/cancelled) produces no event and
JobMeta.status -> None — it is 'missing', same as cancelled. But the
job_group_concluded computation only excluded is_cancelled, so a skipped group
incremented group_runs without pending and became a concluded-empty commit =
a false born-red baseline witness (positive 'proof' the test was absent when it
simply was never run there).

Mirror the existing cancelled handling: treat cancelled OR skipped as missing.
Skipped test jobs are pervasive on trunk (conditionally-gated configs), so this
otherwise fabricates baselines / mis-scopes the partition on many commits.

Found by the PR's Claude review. Adds test_job_group_concluded_excludes_skipped_group.
Full lambda suite green (210).
…ude workflow_dispatch)

A born-red baseline ('test was genuinely absent here') must be proven by a
NATURAL (push/schedule) run. Autorevert's own restarts are workflow_dispatch
runs that are job/test-filtered: the job concludes but runs only the filtered
test(s), so for any OTHER test sharing the (workflow, job_base) such a run
concludes with no event — which the v2 extractor was counting as a false
'test absent' baseline. (The old autorevert_checker already excluded
workflow_dispatch; the v2 datasource query did not.)

Thread workflow_event through JobRow -> JobMeta.is_workflow_dispatch, and in
_build_test_signals exclude workflow_dispatch runs from group_runs/group_pending
so they cannot establish job_group_concluded. Dispatch runs STILL emit their
outcome/pending events, so a gap-restart that the test fails still surfaces a
FAILURE and moves the suspect, and an in-flight restart still acts as a covered
separator (no duplicate restart).

Bisection still converges: failures move the suspect via emitted events; a
dispatch that confirms absence no longer shrinks the gap (it used to via the
false baseline), so that commit stays restart-eligible, but churn is hard-bounded
by the existing per-(workflow,commit) restart cap (5) + 20min pacing + 20-60min
backoff. Recovered-at-head is also no longer falsely triggered by a filtered
dispatch concluded-empty at the head.

Tests: test_workflow_dispatch_run_does_not_establish_baseline (dispatch-only
concluded + no event -> not a baseline -> not born-red) and
test_workflow_dispatch_failure_still_emits_event (dispatch failure still emits,
still doesn't conclude the group). Full lambda suite green (212).

Addresses the second half of the PR's Claude review + Ivan's dispatch-filter point.
…atch_run)

After excluding workflow_dispatch runs from establishing a baseline, a
probed-but-absent gap commit looked identical to never-run (no events,
job_group_concluded=False), so cover_gap re-selected it every tick and the
same commit got re-dispatched up to the soft action-layer cap (5). That churn
is not acceptable — a gap commit should be probed at most once.

Record the presence of a restart on the commit explicitly: new
SignalCommit.has_dispatch_run, set in _build_test_signals when a non-cancelled/
non-skipped workflow_dispatch run exists for the test's job group (pending or
concluded). cover_gap_unknown_commits now treats such a commit as a covered
separator (alongside pending), so it is never re-selected for restart:

    covered = [bool(c.events) or c.has_dispatch_run for c in self.unknown]

A first probe that finds the test failing still emits a FAILURE event (the
commit leaves the gap and moves the suspect); a probe that finds it absent
emits no event but has_dispatch_run keeps it covered — so exactly one dispatch
per gap commit. Cancelled/skipped dispatch runs are ignored (they did not run),
so a genuinely-failed dispatch can still be retried. has_dispatch_run is
job-group level, so a sibling test's restart on the same commit also marks it
probed — at worst an under-probe (suspect stays too-new), never a false
baseline or wrong revert (no event/baseline is created).

Tests: cover_gap skips an already-dispatched gap commit and restarts only the
unprobed one; all-dispatched gap → no restart, advisor still dispatches;
extraction sets has_dispatch_run from a workflow_dispatch run (False for
natural-only); replace()/sentinels updated. Full lambda suite green (215).

Addresses Ivan's 'prevent multiple dispatches on the same commit'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants