fix(auto-fix): refuse full workflow restart when expensive work completed#99
Conversation
… never fires) The NODE_OPTIONS register script monkeypatches child_process.spawn so that agent-relay-broker children have their stdout drained — preventing the broker from blocking on write() once the OS pipe buffer fills. The previous wiring attached a 'data' listener inside a callback that only ran when child.stdout emitted 'pause'. Node Readable streams never emit 'pause' on internal buffer fill — that event only fires when something explicitly calls .pause() (which nothing in this code path ever does), so the stream stayed in paused mode, libuv stopped draining the kernel pipe at high-water mark, and a chatty broker would block in write() once ~64KB of stdout queued up. Symptom: overnight proactive-runtime runs (Ricky-driven, M1 fans out to 9 PTY workers) froze within seconds of fanout with every worker log stuck at the same mtime, broker process parked in write() or _pthread_cond_wait, M1's step.run awaiting a never-arriving drain signal. Reproduced twice (~14h apart) with diagnostic bundles capturing the same shape. Changes - Attach `data` listener and call `resume()` synchronously at spawn time for both `init` and `pty` broker invocations. This matches what SDK 6.0.15's `drainBrokerStdoutAfterStartup` does for direct SDK consumers. - Expand the argv guard from `argv[0]==="init"` to also include `"pty"`, so per-worker PTY brokers (M1's lead + impl-* fanout) are protected, not just the channel-multiplexer init broker. - Update the explanatory comment block above `registerSource` to capture the new semantics and the prior bug. Verification - npm run typecheck — clean - npm test — 1077 / 1077 pass, including the existing "drains broker stdout after SDK startup so event floods cannot wedge the workflow node" regression at entrypoint.test.ts:3122. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ve work completes When Ricky's auto-fix loop could not parse a concrete `failedStep` from the workflow's evidence, the retry path silently omitted `--start-from`. The agent-relay SDK runner treats `--previous-run-id` without `--start-from` as "no resume" (run.ts:91) and starts the workflow over from step 0. For the proactive-runtime M2/M3 chain runs this meant: after `signoff-r1` emitted "SIGNOFF: COMPLETE" the downstream gate or fixer step would occasionally fail in a way that left no per-step failure record, and Ricky would respond by re-spawning every `impl-*` track from scratch — burning ~1.5–2 hours per cycle on duplicated work that the prior signoff already certified. The new policy is: if the prior attempt completed at least one real (non-synthetic) step and we have no step-level resume anchor, escalate to the operator instead of restarting. Synthetic-stage failures (runtime-launch / local-runtime / runtime-precheck) still retry as before because no expensive work happened. Operators who want the old behavior can set `RICKY_AUTO_FIX_ALLOW_FULL_RESTART=1`. The refusal is wired into all three retry paths that previously allowed an undefined `failedStep`: - workflow-repair retry (post-persona / post-deterministic repair) - workflow-repair-provider-failure retry - direct-repair retry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR adds a safety gate preventing automatic full-restart retries when prior workflow attempts have already completed real steps but lack a step-level resume anchor, enriches evidence with parsed per-step data and completedStepCount, integrates the gate into multiple runWithAutoFix retry decision points, and adds tests and fixtures validating escalation and the env override. ChangesAuto-fix retry safety and evidence parsing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Resolves conflict in src/local/entrypoint.ts (broker stdout drain): took main's superset implementation that handles both the immediate flow-mode kick AND the SDK readline-close pause path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/local/auto-fix-loop.ts`:
- Around line 424-443: The code emits a "retrying..." missing-run-id warning
even when you immediately refuse a full restart; update the logic so the
missing-run-id warning is only added when
safeToRetryWithoutStartFromStep(failedStep, evidence) returns true (or move the
missing-run-id warning emission to after that guard). Specifically, adjust where
you push the "Auto-fix retry could not resolve a previous run id; retrying
without step-level resume." into warnings (or set attemptSummary) so it is
conditional on safeToRetryWithoutStartFromStep(failedStep, evidence) before
calling withAutoFix(...) and attachEscalationOptions(...); ensure
FULL_RESTART_REFUSAL_REASON, attemptSummary.fix_error, withAutoFix,
attachEscalationOptions, and debuggerResult.summary/recommendation.steps remain
unchanged otherwise.
- Around line 137-140: workflowDidExpensiveWork currently inspects only
evidence.steps (function workflowDidExpensiveWork, type WorkflowRunEvidence)
which can be truncated when execution.evidence.logs.tail is chatty and causes
false negatives; update the check to use an authoritative source before allowing
a retry: either require and read a persisted full log or an explicit
completed-steps signal carried on the response evidence (e.g., add/read
evidence.completedStepCount or evidence.fullLogs and use that to determine if
any real non-synthetic step completed), and ensure the retry path that looks at
failedStep / startFromStep consults that authoritative field (or rehydrate full
logs from the log store) instead of relying solely on
execution.evidence.logs.tail and evidence.steps.
- Around line 149-155: The code-drift (drift-report) retry path currently
rebuilds a root restart request when failedStep is missing without consulting
safeToRetryWithoutStartFromStep; update the drift-report/code-drift retry branch
to call safeToRetryWithoutStartFromStep(failedStep, evidence) and only perform
the expensive root-restart rebuild if that returns true (otherwise skip the
rebuild/abort the retry or return a non-restart outcome); use the same gating
logic as the workforce/direct/provider-failure flows so that drift retries are
blocked when workflowDidExpensiveWork(evidence) or
fullRestartExplicitlyAllowed() indicate it’s unsafe.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: cd4cddbd-6605-4642-8fe0-7c11463f500f
📒 Files selected for processing (3)
src/local/auto-fix-loop.test.tssrc/local/auto-fix-loop.tssrc/local/entrypoint.ts
| function workflowDidExpensiveWork(evidence: WorkflowRunEvidence): boolean { | ||
| return evidence.steps.some( | ||
| (step) => step.status === 'passed' && !isSyntheticStageId(step.stepId), | ||
| ); |
There was a problem hiding this comment.
This expensive-work detector still has a long-run false negative.
workflowDidExpensiveWork() only looks at evidence.steps, but later in this file those steps are reconstructed from execution.evidence.logs.tail. On chatty workflows, the tail can omit every earlier ✓ ... — completed line while still failing to expose a concrete failedStep, which makes this return false and allows the full restart anyway. To make the refusal reliable, carry an authoritative “real steps completed” signal/count into the response evidence or rehydrate from the persisted full logs before allowing a no-startFromStep retry.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/local/auto-fix-loop.ts` around lines 137 - 140, workflowDidExpensiveWork
currently inspects only evidence.steps (function workflowDidExpensiveWork, type
WorkflowRunEvidence) which can be truncated when execution.evidence.logs.tail is
chatty and causes false negatives; update the check to use an authoritative
source before allowing a retry: either require and read a persisted full log or
an explicit completed-steps signal carried on the response evidence (e.g.,
add/read evidence.completedStepCount or evidence.fullLogs and use that to
determine if any real non-synthetic step completed), and ensure the retry path
that looks at failedStep / startFromStep consults that authoritative field (or
rehydrate full logs from the log store) instead of relying solely on
execution.evidence.logs.tail and evidence.steps.
| function safeToRetryWithoutStartFromStep( | ||
| failedStep: string | undefined, | ||
| evidence: WorkflowRunEvidence, | ||
| ): boolean { | ||
| if (failedStep) return true; | ||
| if (fullRestartExplicitlyAllowed()) return true; | ||
| return !workflowDidExpensiveWork(evidence); |
There was a problem hiding this comment.
Apply this gate to the code-drift retry path too.
safeToRetryWithoutStartFromStep() is only consulted in the workforce/direct/provider-failure flows. The earlier code-drift branch still rebuilds a root restart request when failedStep is missing, so a drift-report workflow can still re-run every completed real step — the same expensive-restart case this PR is trying to block.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/local/auto-fix-loop.ts` around lines 149 - 155, The code-drift
(drift-report) retry path currently rebuilds a root restart request when
failedStep is missing without consulting safeToRetryWithoutStartFromStep; update
the drift-report/code-drift retry branch to call
safeToRetryWithoutStartFromStep(failedStep, evidence) and only perform the
expensive root-restart rebuild if that returns true (otherwise skip the
rebuild/abort the retry or return a non-restart outcome); use the same gating
logic as the workforce/direct/provider-failure flows so that drift retries are
blocked when workflowDidExpensiveWork(evidence) or
fullRestartExplicitlyAllowed() indicate it’s unsafe.
| if (!safeToRetryWithoutStartFromStep(failedStep, evidence)) { | ||
| attemptSummary.fix_error = FULL_RESTART_REFUSAL_REASON; | ||
| warnings.push(FULL_RESTART_REFUSAL_REASON); | ||
| const escalated = withAutoFix(response, maxAttempts, attempts, attemptSummary.status, warnings, trackingRunId); | ||
| escalated.nextActions = [ | ||
| ...escalated.nextActions, | ||
| FULL_RESTART_REFUSAL_REASON, | ||
| debuggerResult.summary, | ||
| ...debuggerResult.recommendation.steps.map((step) => step.description), | ||
| ]; | ||
| attachEscalationOptions(escalated, { | ||
| request: currentRequest, | ||
| response, | ||
| debuggerResult, | ||
| reason: FULL_RESTART_REFUSAL_REASON, | ||
| trackingRunId, | ||
| artifactPath: repairedArtifactPath, | ||
| }); | ||
| return escalated; | ||
| } |
There was a problem hiding this comment.
Avoid emitting a “retrying…” warning on the escalation path.
If runId is missing, the block just above this one already appends Auto-fix retry could not resolve a previous run id; retrying without step-level resume.. When this guard then refuses the full restart, the response contains contradictory guidance: Ricky says it is retrying and refusing to retry in the same result. Move the missing-run-id warning behind this check, or only emit it when safeToRetryWithoutStartFromStep() is true.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/local/auto-fix-loop.ts` around lines 424 - 443, The code emits a
"retrying..." missing-run-id warning even when you immediately refuse a full
restart; update the logic so the missing-run-id warning is only added when
safeToRetryWithoutStartFromStep(failedStep, evidence) returns true (or move the
missing-run-id warning emission to after that guard). Specifically, adjust where
you push the "Auto-fix retry could not resolve a previous run id; retrying
without step-level resume." into warnings (or set attemptSummary) so it is
conditional on safeToRetryWithoutStartFromStep(failedStep, evidence) before
calling withAutoFix(...) and attachEscalationOptions(...); ensure
FULL_RESTART_REFUSAL_REASON, attemptSummary.fix_error, withAutoFix,
attachEscalationOptions, and debuggerResult.summary/recommendation.steps remain
unchanged otherwise.
Ricky Eval ReviewRun: Passed: 8 | Needs human: 43 | Reviewable: 43 | Missing output: 0 | Failed: 0 | Skipped: 0 Human Review CasesThese cases passed deterministic checks and include captured Ricky output for a human verdict against their REVIEW
|
Summary
failedStepfrom evidence, it omitted--start-from; the SDK runner treats--previous-run-idwithout--start-fromas a brand-new run and re-spawned everyimpl-*trackruntime-launch,local-runtime,runtime-precheck) where no expensive work has happenedRICKY_AUTO_FIX_ALLOW_FULL_RESTART=1opts back into old behavior for small workflows where a full restart is genuinely cheapWhy this happened
Both the M2 and M3 chain runs (proactive-runtime, 7 repos, ~$1k+ of Claude/Codex compute per cycle) hit this.
signoff-r1would emitSIGNOFF: COMPLETEwith a full multi-page audit, the downstreamfix-r2/signoff-final/commit-push-*step would fail in a way that leftevidence.failed_stepempty, and Ricky's three retry paths inauto-fix-loop.ts(lines 357, 382, 470) all silently droppedstartFromStep. The next attempt got a freshrunIdand ran the entire DAG from the top. After the third loop M2 was killed manually after ~5 hours.What changed
src/local/auto-fix-loop.ts— addedworkflowDidExpensiveWork,safeToRetryWithoutStartFromStep,fullRestartExplicitlyAllowed,FULL_RESTART_REFUSAL_REASON; wired into all three retry sitessrc/local/auto-fix-loop.test.ts— addedexpensiveWorkBlockerResponsehelper and two regression tests covering both the refusal path and the env-var bypassTest plan
npx vitest run src/local/auto-fix-loop.test.ts— 32/32 pass (was 30/30)npx tsc --noEmitcleannpx vitest run— 1077/1079 (same 2 pre-existing flaky integration tests fail on main)runtime-launchetc.) still restarts correctly — this is preserved on purposeOperational note
The next time a long workflow hits this failure mode with this branch installed, Ricky surfaces a clear refusal message pointing the operator at:
Cached step outputs live at
.agent-relay/step-outputs/<runId>/, so manual resume-from-any-step works.🤖 Generated with Claude Code