Skip to content

v0.42.54.0 feat(facts): durable extraction — recovery phase + intra-page resume + enforced deadline#2502

Open
danwiggins wants to merge 2 commits into
garrytan:masterfrom
danwiggins:durable-fact-extraction
Open

v0.42.54.0 feat(facts): durable extraction — recovery phase + intra-page resume + enforced deadline#2502
danwiggins wants to merge 2 commits into
garrytan:masterfrom
danwiggins:durable-fact-extraction

Conversation

@danwiggins

Copy link
Copy Markdown

Problem

Fact extraction could silently drop on a real-world install. Real-time extraction (runFactsBackstop queue mode) runs fire-and-forget on the in-memory FactsQueue; when the writing process exits before that settles, background-work shutdown aborts the in-flight chat call. The page persists with zero facts and nothing retries it. The durable catch-up that should rescue it (conversation_facts_backfill) only handles chat-shaped pages, is opt-in/off, and — even when on — wiped the whole page at the start of every run, so a page too large for one cycle's budget could never finish.

Observed on a live brain: meeting/Slack pages landing with an aborted facts:absorb row in ingest_log and never recovering; the autopilot cycle reporting facts_consolidated: 0.

Fix

Three changes:

  1. realtime_absorb_recovery cycle phase (default-on, bounded). Treats unresolved facts:absorb failure rows in ingest_log as a durable backlog (no new table), re-runs the pipeline inline (which carries the existing 0.95 cosine dedup, so it's idempotent), and appends a facts:absorb-recovered tombstone on success. Bounded by page cap (25), cost cap ($0.25), and a wall-clock deadline (240s); kill switch cycle.realtime_absorb_recovery.enabled=false. Recovers narrative pages the chat-shaped backfill can't parse.

  2. Cursor-only-on-confirmed-write + intra-page resume in extract-conversation-facts.ts. The per-page checkpoint now carries a row_num watermark and advances per segment on confirmed commit; deleteOrphanFactsForPage is scoped to the uncommitted tail (row_num >= watermark). A page that can't finish in one cycle's budget makes monotonic forward progress instead of wipe→re-extract→re-exhaust forever. A swallowed insert/extract failure no longer advances the cursor past unwritten facts. Legacy checkpoints (no watermark) force one safe full re-extract on upgrade.

  3. Enforced per-source wall-clock deadline + budget alignment. The per-source cap was read from config but never passed to the worker, and the brain-wide cap only checks between sources — so a single-source brain had no wall-clock ceiling and a long drain could blow the autopilot job timeout. Now passed + enforced; defaults lowered (4 min/source, 6 min total) to sit under the ~600s job window. Also fixed a fullyProcessed off-by-one that skipped the terminal row (livelock) when a page had exactly segmentLimit segments.

Review + verification

  • Dual independent review (Claude + Codex) on the plan caught 4 wrong assumptions before build; a Codex crash-safety review of the resume rework caught 2 real bugs (a legacy-checkpoint data-loss path and the segmentLimit livelock) — both fixed with regression tests.
  • New tests: insert-failure regression, intra-page resume (partial → resume, no re-wipe, row_num continuity), checkpoint encoding (legacy vs watermarked), segmentLimit-exact completion, and 6 recovery-phase tests (recover + tombstone, idempotency, per-page-failure leaves un-tombstoned, page-gone, kill switch, default-on). Phase-count fixtures updated.
  • Verified live on a single-source brain: the recovery phase cleared a real backlog (one meeting page + several conversation pages, ~30 facts) and tombstoned a dead probe page, with the autopilot stable across restart.

Engine-parity: no new engine methods; the backlog query is portable executeRaw.

danwiggins and others added 2 commits June 30, 2026 03:25
…+ enforced deadline

- realtime_absorb_recovery cycle phase (default on, bounded): recovers pages
  whose real-time fact extraction was dropped on process exit, by re-running
  the pipeline inline (idempotent via existing dedup) and tombstoning the
  ingest_log failure record. No new table.
- cursor-only-on-confirmed-write: a swallowed insert/extract failure no longer
  advances the resume cursor past unwritten facts.
- intra-page resume: per-page checkpoint carries a row_num watermark, advances
  per segment on confirmed commit; scoped delete-orphans preserves the
  committed prefix so a large page makes monotonic forward progress.
- enforced per-source wall-clock deadline + budgets aligned to the autopilot
  job window.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant