|
| 1 | +// Package leadership implements leader election for River clients sharing a |
| 2 | +// database schema. The database records at most one current leadership term at |
| 3 | +// a time; the elected client runs distributed maintenance operations (queue |
| 4 | +// management, job scheduling, reindexing) that should not be duplicated across |
| 5 | +// clients. |
| 6 | +// |
| 7 | +// # Design Principles |
| 8 | +// |
| 9 | +// The database is the source of truth for leadership acquisition and renewal. |
| 10 | +// A single row in the river_leader table represents the current leadership |
| 11 | +// term. Local time is used only to bound how long a process trusts its last |
| 12 | +// successful DB confirmation. When there is any uncertainty about whether the |
| 13 | +// client is still the rightful leader, it errs on the side of giving up |
| 14 | +// leadership proactively rather than risk operating as a stale leader while |
| 15 | +// the DB record no longer reflects its status. |
| 16 | +// |
| 17 | +// # State Machine |
| 18 | +// |
| 19 | +// The elector alternates between two states in a loop: |
| 20 | +// |
| 21 | +// Start() |
| 22 | +// │ |
| 23 | +// ▼ |
| 24 | +// ┌───────────────────────────────────┐ |
| 25 | +// │ FOLLOWER STATE │ |
| 26 | +// │ runFollowerState() │ |
| 27 | +// │ │ |
| 28 | +// │ Attempts election on each tick. │ |
| 29 | +// │ DELETE expired + INSERT ON │ |
| 30 | +// │ CONFLICT DO NOTHING (in a txn). │ |
| 31 | +// │ Sleeps between attempts, or │ |
| 32 | +// │ wakes early on DB notification. │ |
| 33 | +// └───────────────┬───────────────────┘ |
| 34 | +// │ won election |
| 35 | +// │ (returns leadershipTerm) |
| 36 | +// │ publishLeadershipState(true) |
| 37 | +// │ signal GainedLeadership |
| 38 | +// ▼ |
| 39 | +// ┌───────────────────────────────────┐ |
| 40 | +// │ LEADER STATE │ |
| 41 | +// │ runLeaderState() │ |
| 42 | +// │ │ |
| 43 | +// │ Periodically reelects via │ |
| 44 | +// │ UPDATE WHERE elected_at matches. │ |
| 45 | +// │ Steps down if trust expires, │ |
| 46 | +// │ term is replaced, or forced │ |
| 47 | +// │ resign is received. │ |
| 48 | +// └───────────────┬───────────────────┘ |
| 49 | +// │ lost / resigned / error |
| 50 | +// │ publishLeadershipState(false) |
| 51 | +// │ attemptResignLoop() (best effort) |
| 52 | +// │ |
| 53 | +// └───► back to FOLLOWER |
| 54 | +// |
| 55 | +// # DB-Issued Terms |
| 56 | +// |
| 57 | +// Each leadership term is uniquely identified by the elected_at timestamp |
| 58 | +// assigned by the database when the leader row is inserted. This timestamp is |
| 59 | +// the term token for operations on the river_leader row, analogous to a term |
| 60 | +// number in other leader-election systems. |
| 61 | +// |
| 62 | +// The leader-row DB operations are scoped to the exact term: |
| 63 | +// |
| 64 | +// - Reelect: UPDATE ... WHERE elected_at = @elected_at AND leader_id = @leader_id |
| 65 | +// - Resign: DELETE ... WHERE elected_at = @elected_at AND leader_id = @leader_id |
| 66 | +// |
| 67 | +// If another client takes over (producing a different elected_at), all |
| 68 | +// operations for the old term become no-ops. A client can never accidentally |
| 69 | +// reelect or resign a different term than the one it believes it holds. |
| 70 | +// |
| 71 | +// The leadershipTerm struct captures this: |
| 72 | +// |
| 73 | +// leadershipTerm { |
| 74 | +// clientID ← who holds this term |
| 75 | +// electedAt ← DB-issued timestamp (fencing token) |
| 76 | +// trustedUntil ← local deadline after which leader must step down |
| 77 | +// } |
| 78 | +// |
| 79 | +// # Trust Window |
| 80 | +// |
| 81 | +// After winning or reelecting, the client computes a local trust deadline: |
| 82 | +// |
| 83 | +// trustedUntil = attemptStarted + TTL - safetyMargin |
| 84 | +// |
| 85 | +// Where: |
| 86 | +// - attemptStarted is the local wall-clock time when the elect/reelect |
| 87 | +// call was initiated (before the DB round-trip) |
| 88 | +// - TTL is electInterval + electIntervalTTLPaddingDefault (default 15s) |
| 89 | +// - safetyMargin is leaderLocalDeadlineSafetyMargin (default 1s) |
| 90 | +// |
| 91 | +// Timeline for a single term with default timing (5s electInterval, 15s TTL): |
| 92 | +// |
| 93 | +// attemptStarted reelect 1 reelect 2 trustedUntil DB expires_at |
| 94 | +// (local clock) (+5s) (+10s) (+14s) (+15s) |
| 95 | +// │ │ │ │ │ |
| 96 | +// │◄ electInterval ─►│◄ electInterval ►│ │ │ |
| 97 | +// │ │ │◄─ 4s buffer ─►│ │ |
| 98 | +// │◄──────────── TTL - 1s margin (14s) ───────────────►│ │ |
| 99 | +// │◄──────────────────── TTL (15s) ────────────────────┼──── 1s ────►│ |
| 100 | +// |
| 101 | +// With a longer electInterval, fewer scheduled reelect ticks may fit before |
| 102 | +// trustedUntil, but the trust window still ends safetyMargin before the DB row |
| 103 | +// expires. |
| 104 | +// |
| 105 | +// Key properties: |
| 106 | +// - trustedUntil is anchored to LOCAL time, never extended beyond what |
| 107 | +// the DB confirmed via a successful reelect response. |
| 108 | +// - A successful reelect renews the current term, but trustedUntil is |
| 109 | +// computed from the moment the reelect was initiated (attemptStarted), |
| 110 | +// not from when the response arrived. A slow DB response cannot extend |
| 111 | +// the trust window. |
| 112 | +// - The 1s safety margin absorbs network latency (time between local |
| 113 | +// Now() and the DB executing now()). |
| 114 | +// - The 10s TTL padding absorbs expected clock skew between client and DB. |
| 115 | +// |
| 116 | +// If the client cannot successfully reelect before trustedUntil, it |
| 117 | +// voluntarily steps down, even if the DB might still show it as leader. |
| 118 | +// |
| 119 | +// # Proactive Step-Down |
| 120 | +// |
| 121 | +// The client gives up leadership in the following scenarios: |
| 122 | +// |
| 123 | +// - Local trust window expires (reelectAttemptTimeout returns 0). |
| 124 | +// - Reelect returns ErrNotFound, meaning the term was replaced externally |
| 125 | +// (e.g., expired and won by another client). |
| 126 | +// - Reelect errors accumulate and the remaining trust window is exhausted. |
| 127 | +// - A forced resignation is received via DB notification (request_resign). |
| 128 | +// - The elector's context is cancelled (shutdown). |
| 129 | +// |
| 130 | +// On local step-down, a best-effort resign is usually attempted using |
| 131 | +// context.WithoutCancel to ensure it runs even during shutdown. The exception |
| 132 | +// is when reelect returns ErrNotFound; in that case the DB has already |
| 133 | +// authoritatively said the term is gone, so no resign is attempted. If a |
| 134 | +// best-effort resign fails (e.g., DB unreachable), the TTL on the DB row acts |
| 135 | +// as a safety net: the row expires naturally, allowing a new leader to be |
| 136 | +// elected. |
| 137 | +// |
| 138 | +// # Notification Flow |
| 139 | +// |
| 140 | +// When a DB notifier is available, the elector listens for two event types: |
| 141 | +// |
| 142 | +// - resigned: Another client resigned leadership. Followers wake up to |
| 143 | +// attempt election immediately (with small jitter to avoid thundering |
| 144 | +// herd). |
| 145 | +// - request_resign: An external request (e.g., from QueueMaintainerLeader |
| 146 | +// after start failures) asking the current leader to step down. Only |
| 147 | +// honored if the elector is currently the leader. |
| 148 | +// |
| 149 | +// Both event types send to a single buffered wakeupChan using non-blocking |
| 150 | +// semantics (trySendWakeup). This guarantees the shared notifier goroutine |
| 151 | +// (which serves all notification topics, not just leadership) can never be |
| 152 | +// blocked by the elector. Multiple rapid notifications coalesce into a single |
| 153 | +// wakeup. Timer-based polling at electInterval provides a fallback for any |
| 154 | +// missed notifications or when running without a notifier (poll-only mode). |
| 155 | +// |
| 156 | +// # Subscription Relay |
| 157 | +// |
| 158 | +// Consumers subscribe to leadership changes via Listen(), which returns a |
| 159 | +// Subscription. Internally, each subscription uses a subscriptionRelay with |
| 160 | +// a dedicated goroutine that drains an unbounded pending queue into a |
| 161 | +// buffered channel. This design ensures: |
| 162 | +// |
| 163 | +// - The elector never blocks on slow subscribers when publishing state |
| 164 | +// changes. |
| 165 | +// - Every transition is preserved in order (true, false, true, ...). |
| 166 | +// The primary consumer (QueueMaintainerLeader) needs every false |
| 167 | +// transition to properly stop maintenance services; dropping transitions |
| 168 | +// would leave stale services running. |
| 169 | +// |
| 170 | +// Subscribers must call Unlisten() when done to stop the relay goroutine. |
| 171 | +// |
| 172 | +// # Failure Scenarios |
| 173 | +// |
| 174 | +// - DB temporarily unavailable: Reelect fails, errors accumulate, |
| 175 | +// trust window expires, client steps down. Followers retry with |
| 176 | +// exponential backoff. |
| 177 | +// - Network partition: Same as DB unavailable. Client steps down |
| 178 | +// proactively. |
| 179 | +// - Long GC pause or process stall: On resume, the trust window has |
| 180 | +// already expired, and the client steps down immediately without |
| 181 | +// attempting any DB operations. |
| 182 | +// - DB failover to a server with different clock: Local trust is the |
| 183 | +// binding constraint. The client steps down conservatively. |
| 184 | +// - Resign fails on shutdown: The DB row's TTL expires naturally, |
| 185 | +// allowing a new leader to be elected. |
| 186 | +// - Rapid resign notifications: Coalesced to one wakeup. Timer-based |
| 187 | +// polling provides the backstop. |
| 188 | +package leadership |
0 commit comments