riverqueue
diff --git a/‎internal/leadership/doc.go‎
Lines changed: 188 additions & 0 deletions b/‎internal/leadership/doc.go‎
Lines changed: 188 additions & 0 deletions
@@ -0,0 +1,188 @@
+// Package leadership implements leader election for River clients sharing a
+// database schema. The database records at most one current leadership term at
+// a time; the elected client runs distributed maintenance operations (queue
+// management, job scheduling, reindexing) that should not be duplicated across
+// clients.
+//
+// # Design Principles
+//
+// The database is the source of truth for leadership acquisition and renewal.
+// A single row in the river_leader table represents the current leadership
+// term. Local time is used only to bound how long a process trusts its last
+// successful DB confirmation. When there is any uncertainty about whether the
+// client is still the rightful leader, it errs on the side of giving up
+// leadership proactively rather than risk operating as a stale leader while
+// the DB record no longer reflects its status.
+//
+// # State Machine
+//
+// The elector alternates between two states in a loop:
+//
+//	                Start()
+//	                  │
+//	                  ▼
+//	┌───────────────────────────────────┐
+//	│          FOLLOWER STATE           │
+//	│        runFollowerState()         │
+//	│                                   │
+//	│  Attempts election on each tick.  │
+//	│  DELETE expired + INSERT ON       │
+//	│  CONFLICT DO NOTHING (in a txn).  │
+//	│  Sleeps between attempts, or      │
+//	│  wakes early on DB notification.  │
+//	└───────────────┬───────────────────┘
+//	                │ won election
+//	                │ (returns leadershipTerm)
+//	                │ publishLeadershipState(true)
+//	                │ signal GainedLeadership
+//	                ▼
+//	┌───────────────────────────────────┐
+//	│           LEADER STATE            │
+//	│         runLeaderState()          │
+//	│                                   │
+//	│  Periodically reelects via        │
+//	│  UPDATE WHERE elected_at matches. │
+//	│  Steps down if trust expires,     │
+//	│  term is replaced, or forced      │
+//	│  resign is received.              │
+//	└───────────────┬───────────────────┘
+//	                │ lost / resigned / error
+//	                │ publishLeadershipState(false)
+//	                │ attemptResignLoop() (best effort)
+//	                │
+//	                └───► back to FOLLOWER
+//
+// # DB-Issued Terms
+//
+// Each leadership term is uniquely identified by the elected_at timestamp
+// assigned by the database when the leader row is inserted. This timestamp is
+// the term token for operations on the river_leader row, analogous to a term
+// number in other leader-election systems.
+//
+// The leader-row DB operations are scoped to the exact term:
+//
+//   - Reelect: UPDATE ... WHERE elected_at = @elected_at AND leader_id = @leader_id
+//   - Resign: DELETE ... WHERE elected_at = @elected_at AND leader_id = @leader_id
+//
+// If another client takes over (producing a different elected_at), all
+// operations for the old term become no-ops. A client can never accidentally
+// reelect or resign a different term than the one it believes it holds.
+//
+// The leadershipTerm struct captures this:
+//
+//	leadershipTerm {
+//	    clientID     ← who holds this term
+//	    electedAt    ← DB-issued timestamp (fencing token)
+//	    trustedUntil ← local deadline after which leader must step down
+//	}
+//
+// # Trust Window
+//
+// After winning or reelecting, the client computes a local trust deadline:
+//
+//	trustedUntil = attemptStarted + TTL - safetyMargin
+//
+// Where:
+//   - attemptStarted is the local wall-clock time when the elect/reelect
+//     call was initiated (before the DB round-trip)
+//   - TTL is electInterval + electIntervalTTLPaddingDefault (default 15s)
+//   - safetyMargin is leaderLocalDeadlineSafetyMargin (default 1s)
+//
+// Timeline for a single term with default timing (5s electInterval, 15s TTL):
+//
+//	attemptStarted      reelect 1        reelect 2      trustedUntil   DB expires_at
+//	(local clock)        (+5s)            (+10s)          (+14s)         (+15s)
+//	     │                  │                 │               │             │
+//	     │◄ electInterval ─►│◄ electInterval ►│               │             │
+//	     │                  │                 │◄─ 4s buffer ─►│             │
+//	     │◄──────────── TTL - 1s margin (14s) ───────────────►│             │
+//	     │◄──────────────────── TTL (15s) ────────────────────┼──── 1s ────►│
+//
+// With a longer electInterval, fewer scheduled reelect ticks may fit before
+// trustedUntil, but the trust window still ends safetyMargin before the DB row
+// expires.
+//
+// Key properties:
+//   - trustedUntil is anchored to LOCAL time, never extended beyond what
+//     the DB confirmed via a successful reelect response.
+//   - A successful reelect renews the current term, but trustedUntil is
+//     computed from the moment the reelect was initiated (attemptStarted),
+//     not from when the response arrived. A slow DB response cannot extend
+//     the trust window.
+//   - The 1s safety margin absorbs network latency (time between local
+//     Now() and the DB executing now()).
+//   - The 10s TTL padding absorbs expected clock skew between client and DB.
+//
+// If the client cannot successfully reelect before trustedUntil, it
+// voluntarily steps down, even if the DB might still show it as leader.
+//
+// # Proactive Step-Down
+//
+// The client gives up leadership in the following scenarios:
+//
+//   - Local trust window expires (reelectAttemptTimeout returns 0).
+//   - Reelect returns ErrNotFound, meaning the term was replaced externally
+//     (e.g., expired and won by another client).
+//   - Reelect errors accumulate and the remaining trust window is exhausted.
+//   - A forced resignation is received via DB notification (request_resign).
+//   - The elector's context is cancelled (shutdown).
+//
+// On local step-down, a best-effort resign is usually attempted using
+// context.WithoutCancel to ensure it runs even during shutdown. The exception
+// is when reelect returns ErrNotFound; in that case the DB has already
+// authoritatively said the term is gone, so no resign is attempted. If a
+// best-effort resign fails (e.g., DB unreachable), the TTL on the DB row acts
+// as a safety net: the row expires naturally, allowing a new leader to be
+// elected.
+//
+// # Notification Flow
+//
+// When a DB notifier is available, the elector listens for two event types:
+//
+//   - resigned: Another client resigned leadership. Followers wake up to
+//     attempt election immediately (with small jitter to avoid thundering
+//     herd).
+//   - request_resign: An external request (e.g., from QueueMaintainerLeader
+//     after start failures) asking the current leader to step down. Only
+//     honored if the elector is currently the leader.
+//
+// Both event types send to a single buffered wakeupChan using non-blocking
+// semantics (trySendWakeup). This guarantees the shared notifier goroutine
+// (which serves all notification topics, not just leadership) can never be
+// blocked by the elector. Multiple rapid notifications coalesce into a single
+// wakeup. Timer-based polling at electInterval provides a fallback for any
+// missed notifications or when running without a notifier (poll-only mode).
+//
+// # Subscription Relay
+//
+// Consumers subscribe to leadership changes via Listen(), which returns a
+// Subscription. Internally, each subscription uses a subscriptionRelay with
+// a dedicated goroutine that drains an unbounded pending queue into a
+// buffered channel. This design ensures:
+//
+//   - The elector never blocks on slow subscribers when publishing state
+//     changes.
+//   - Every transition is preserved in order (true, false, true, ...).
+//     The primary consumer (QueueMaintainerLeader) needs every false
+//     transition to properly stop maintenance services; dropping transitions
+//     would leave stale services running.
+//
+// Subscribers must call Unlisten() when done to stop the relay goroutine.
+//
+// # Failure Scenarios
+//
+//   - DB temporarily unavailable: Reelect fails, errors accumulate,
+//     trust window expires, client steps down. Followers retry with
+//     exponential backoff.
+//   - Network partition: Same as DB unavailable. Client steps down
+//     proactively.
+//   - Long GC pause or process stall: On resume, the trust window has
+//     already expired, and the client steps down immediately without
+//     attempting any DB operations.
+//   - DB failover to a server with different clock: Local trust is the
+//     binding constraint. The client steps down conservatively.
+//   - Resign fails on shutdown: The DB row's TTL expires naturally,
+//     allowing a new leader to be elected.
+//   - Rapid resign notifications: Coalesced to one wakeup. Timer-based
+//     polling provides the backstop.
+package leadership