Skip to content

Commit d41a45e

Browse files
committed
refactor leader election around DB-issued terms
The old elector treated leadership as a lease held by one client ID and renewed over time. That was simple, but it left too much implicit. One leadership term was not clearly separated from the next, so reelection and resignation were not scoped to a specific term. That made same- client reacquisition harder to reason about, made it easy for stale work or cleanup to target the wrong lease, and left the elector carrying more responsibility for edge cases than it should have. This change makes the database issue explicit leadership terms using the columns we already have. `leader_id` remains the stable client ID, while `(leader_id, elected_at)` identifies one specific term. Elect, reelect, and resign now all operate on that exact term and return the leader row from the database. The elector keeps a bounded local trust window for its last successful confirmation, but that window is anchored to the attempt that produced it, not to when the response happened to arrive. That keeps slow successful reelections from stretching leadership past its real lease budget while still avoiding direct app-vs-database clock comparisons in the state machine. The notification and test story is also clearer after the rewrite. Slow subscribers now receive each leadership transition in order without blocking the elector, resignation wakeups are coalesced safely, and the poll-only coverage uses isolated fixtures so it can exercise real handoff behavior without shared-schema flakiness. The shared driver suite now covers term-scoped elect, reelect, and resign behavior across PostgreSQL and both SQLite backends, including same-client term replacement and stale-term rejection. The elector tests focus on the observable behaviors that matter: gaining leadership, handing it off, responding to resign requests, and stepping down cleanly when its trust window expires. This also rolls up the branch's earlier flake investigation and keeps the original CI reference for the shared-schema failures that led to the redesign: https://github.com/riverqueue/river/actions/runs/24406465152
1 parent b32562c commit d41a45e

13 files changed

Lines changed: 1513 additions & 735 deletions

File tree

internal/leadership/doc.go

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
// Package leadership implements leader election for River clients sharing a
2+
// database schema. The database records at most one current leadership term at
3+
// a time; the elected client runs distributed maintenance operations (queue
4+
// management, job scheduling, reindexing) that should not be duplicated across
5+
// clients.
6+
//
7+
// # Design Principles
8+
//
9+
// The database is the source of truth for leadership acquisition and renewal.
10+
// A single row in the river_leader table represents the current leadership
11+
// term. Local time is used only to bound how long a process trusts its last
12+
// successful DB confirmation. When there is any uncertainty about whether the
13+
// client is still the rightful leader, it errs on the side of giving up
14+
// leadership proactively rather than risk operating as a stale leader while
15+
// the DB record no longer reflects its status.
16+
//
17+
// # State Machine
18+
//
19+
// The elector alternates between two states in a loop:
20+
//
21+
// Start()
22+
// │
23+
// ▼
24+
// ┌───────────────────────────────────┐
25+
// │ FOLLOWER STATE │
26+
// │ runFollowerState() │
27+
// │ │
28+
// │ Attempts election on each tick. │
29+
// │ DELETE expired + INSERT ON │
30+
// │ CONFLICT DO NOTHING (in a txn). │
31+
// │ Sleeps between attempts, or │
32+
// │ wakes early on DB notification. │
33+
// └───────────────┬───────────────────┘
34+
// │ won election
35+
// │ (returns leadershipTerm)
36+
// │ publishLeadershipState(true)
37+
// │ signal GainedLeadership
38+
// ▼
39+
// ┌───────────────────────────────────┐
40+
// │ LEADER STATE │
41+
// │ runLeaderState() │
42+
// │ │
43+
// │ Periodically reelects via │
44+
// │ UPDATE WHERE elected_at matches. │
45+
// │ Steps down if trust expires, │
46+
// │ term is replaced, or forced │
47+
// │ resign is received. │
48+
// └───────────────┬───────────────────┘
49+
// │ lost / resigned / error
50+
// │ publishLeadershipState(false)
51+
// │ attemptResignLoop() (best effort)
52+
// │
53+
// └───► back to FOLLOWER
54+
//
55+
// # DB-Issued Terms
56+
//
57+
// Each leadership term is uniquely identified by the elected_at timestamp
58+
// assigned by the database when the leader row is inserted. This timestamp is
59+
// the term token for operations on the river_leader row, analogous to a term
60+
// number in other leader-election systems.
61+
//
62+
// The leader-row DB operations are scoped to the exact term:
63+
//
64+
// - Reelect: UPDATE ... WHERE elected_at = @elected_at AND leader_id = @leader_id
65+
// - Resign: DELETE ... WHERE elected_at = @elected_at AND leader_id = @leader_id
66+
//
67+
// If another client takes over (producing a different elected_at), all
68+
// operations for the old term become no-ops. A client can never accidentally
69+
// reelect or resign a different term than the one it believes it holds.
70+
//
71+
// The leadershipTerm struct captures this:
72+
//
73+
// leadershipTerm {
74+
// clientID ← who holds this term
75+
// electedAt ← DB-issued timestamp (fencing token)
76+
// trustedUntil ← local deadline after which leader must step down
77+
// }
78+
//
79+
// # Trust Window
80+
//
81+
// After winning or reelecting, the client computes a local trust deadline:
82+
//
83+
// trustedUntil = attemptStarted + TTL - safetyMargin
84+
//
85+
// Where:
86+
// - attemptStarted is the local wall-clock time when the elect/reelect
87+
// call was initiated (before the DB round-trip)
88+
// - TTL is electInterval + electIntervalTTLPaddingDefault (default 15s)
89+
// - safetyMargin is leaderLocalDeadlineSafetyMargin (default 1s)
90+
//
91+
// Timeline for a single term with default timing (5s electInterval, 15s TTL):
92+
//
93+
// attemptStarted reelect 1 reelect 2 trustedUntil DB expires_at
94+
// (local clock) (+5s) (+10s) (+14s) (+15s)
95+
// │ │ │ │ │
96+
// │◄ electInterval ─►│◄ electInterval ►│ │ │
97+
// │ │ │◄─ 4s buffer ─►│ │
98+
// │◄──────────── TTL - 1s margin (14s) ───────────────►│ │
99+
// │◄──────────────────── TTL (15s) ────────────────────┼──── 1s ────►│
100+
//
101+
// With a longer electInterval, fewer scheduled reelect ticks may fit before
102+
// trustedUntil, but the trust window still ends safetyMargin before the DB row
103+
// expires.
104+
//
105+
// Key properties:
106+
// - trustedUntil is anchored to LOCAL time, never extended beyond what
107+
// the DB confirmed via a successful reelect response.
108+
// - A successful reelect renews the current term, but trustedUntil is
109+
// computed from the moment the reelect was initiated (attemptStarted),
110+
// not from when the response arrived. A slow DB response cannot extend
111+
// the trust window.
112+
// - The 1s safety margin absorbs network latency (time between local
113+
// Now() and the DB executing now()).
114+
// - The 10s TTL padding absorbs expected clock skew between client and DB.
115+
//
116+
// If the client cannot successfully reelect before trustedUntil, it
117+
// voluntarily steps down, even if the DB might still show it as leader.
118+
//
119+
// # Proactive Step-Down
120+
//
121+
// The client gives up leadership in the following scenarios:
122+
//
123+
// - Local trust window expires (reelectAttemptTimeout returns 0).
124+
// - Reelect returns ErrNotFound, meaning the term was replaced externally
125+
// (e.g., expired and won by another client).
126+
// - Reelect errors accumulate and the remaining trust window is exhausted.
127+
// - A forced resignation is received via DB notification (request_resign).
128+
// - The elector's context is cancelled (shutdown).
129+
//
130+
// On local step-down, a best-effort resign is usually attempted using
131+
// context.WithoutCancel to ensure it runs even during shutdown. The exception
132+
// is when reelect returns ErrNotFound; in that case the DB has already
133+
// authoritatively said the term is gone, so no resign is attempted. If a
134+
// best-effort resign fails (e.g., DB unreachable), the TTL on the DB row acts
135+
// as a safety net: the row expires naturally, allowing a new leader to be
136+
// elected.
137+
//
138+
// # Notification Flow
139+
//
140+
// When a DB notifier is available, the elector listens for two event types:
141+
//
142+
// - resigned: Another client resigned leadership. Followers wake up to
143+
// attempt election immediately (with small jitter to avoid thundering
144+
// herd).
145+
// - request_resign: An external request (e.g., from QueueMaintainerLeader
146+
// after start failures) asking the current leader to step down. Only
147+
// honored if the elector is currently the leader.
148+
//
149+
// Both event types send to a single buffered wakeupChan using non-blocking
150+
// semantics (trySendWakeup). This guarantees the shared notifier goroutine
151+
// (which serves all notification topics, not just leadership) can never be
152+
// blocked by the elector. Multiple rapid notifications coalesce into a single
153+
// wakeup. Timer-based polling at electInterval provides a fallback for any
154+
// missed notifications or when running without a notifier (poll-only mode).
155+
//
156+
// # Subscription Relay
157+
//
158+
// Consumers subscribe to leadership changes via Listen(), which returns a
159+
// Subscription. Internally, each subscription uses a subscriptionRelay with
160+
// a dedicated goroutine that drains an unbounded pending queue into a
161+
// buffered channel. This design ensures:
162+
//
163+
// - The elector never blocks on slow subscribers when publishing state
164+
// changes.
165+
// - Every transition is preserved in order (true, false, true, ...).
166+
// The primary consumer (QueueMaintainerLeader) needs every false
167+
// transition to properly stop maintenance services; dropping transitions
168+
// would leave stale services running.
169+
//
170+
// Subscribers must call Unlisten() when done to stop the relay goroutine.
171+
//
172+
// # Failure Scenarios
173+
//
174+
// - DB temporarily unavailable: Reelect fails, errors accumulate,
175+
// trust window expires, client steps down. Followers retry with
176+
// exponential backoff.
177+
// - Network partition: Same as DB unavailable. Client steps down
178+
// proactively.
179+
// - Long GC pause or process stall: On resume, the trust window has
180+
// already expired, and the client steps down immediately without
181+
// attempting any DB operations.
182+
// - DB failover to a server with different clock: Local trust is the
183+
// binding constraint. The client steps down conservatively.
184+
// - Resign fails on shutdown: The DB row's TTL expires naturally,
185+
// allowing a new leader to be elected.
186+
// - Rapid resign notifications: Coalesced to one wakeup. Timer-based
187+
// polling provides the backstop.
188+
package leadership

0 commit comments

Comments
 (0)