Skip to content

fix: address AOF unbounded growth and slow leader memory leak (#685, #769)#802

Open
Mukund2900 wants to merge 1 commit intotidwall:masterfrom
Mukund2900:fix/memory-leak-and-aof-growth
Open

fix: address AOF unbounded growth and slow leader memory leak (#685, #769)#802
Mukund2900 wants to merge 1 commit intotidwall:masterfrom
Mukund2900:fix/memory-leak-and-aof-growth

Conversation

@Mukund2900
Copy link

@Mukund2900 Mukund2900 commented Mar 16, 2026

Summary

This PR addresses two related issues affecting tile38 at scale with many Kafka hooks:

Both issues share the same root architecture: tile38's hook processing at high throughput (thousands of SETs/sec with 70K+ hooks) creates unbounded AOF growth and massive allocation pressure that fragments the Go heap over time.

What's wrong today

1. No automatic AOF compaction

Every write command (SET, SETHOOK, DEL, DELHOOK) is appended to the AOF. With thousands of vehicle SETs per second and 70K+ geofence hooks, the AOF grows to several GB in minutes. There is no built-in mechanism to automatically compact it — users must run AOFSHRINK manually or via external cron jobs.

2. Redundant AOF writes on followers

When objects/hooks expire via TTL, backgroundExpireObjects() and backgroundExpireHooks() write DEL/DELHOOK commands to the AOF on both leader and followers. Followers don't need these writes — they can deterministically compute the same expirations from the original commands (which contain the TTL). These redundant writes cause the follower's AOF to bloat just as fast as the leader's.

3. Hook proc() scans the entire qdb index (BUG — primary root cause of #769)

Each hook's proc() function (webhook queue processor) calls tx.AscendGreaterOrEqual(\"hooks\", h.query, callback) where the callback always returns true. The buntdb "hooks" index is sorted by hook name, so entries for each hook are contiguous. But because the callback never returns false, every proc() call scans all remaining entries in the entire index past this hook's position.

With 70K hooks and thousands of pending events:

  • Each proc() call performs an O(N) scan instead of O(entries-for-this-hook)
  • It parses JSON (gjson.Get) for every unrelated entry
  • It creates temporary string copies for every scanned value

The 70K manager goroutines running concurrently create enormous allocation pressure. Over weeks, Go's non-compacting garbage collector fragments the heap — pages with mixed live/dead objects can't be returned to the OS, so process RSS grows while HeapAlloc stays stable. This is why the leader's free memory slowly decreases over 1-1.5 months until OOM.

This only affects the leader because queueHooks() (which populates qdb) only runs on the leader. Followers' hook goroutines are idle — they never call proc().

4. Kafka connections are never recycled under sustained load

KafkaConn expires after 30 seconds of inactivity. With 70K hooks constantly sending events, the connection is never idle — the same sarama SyncProducer runs for the entire lifetime of the process (potentially months).

Note: This is a defensive/precautionary fix. There is no confirmed memory leak in the sarama library for long-lived producers — sarama's metadata refresh allocations are GC-eligible, its batching buffers are bounded, and its metrics registry is stable for a single producer instance. However, periodically recycling long-lived connections is reasonable production hygiene.

Changes

Change 1: Auto-AOFSHRINK (aofshrink.go, config.go, server.go) — fixes #685

New `aofshrink-min-size` config property. When set (e.g. `CONFIG SET aofshrink-min-size 256mb`), a background goroutine checks AOF size every 10 seconds and triggers AOFSHRINK automatically when the threshold is exceeded, with a 1-minute cooldown. Disabled by default (value 0).

Change 2: Skip AOF writes for TTL expirations on followers (expire.go) — fixes #685 on followers

Skip writeAOF() in backgroundExpireObjects() and backgroundExpireHooks() when running as a follower. In-memory state is still updated by cmdDEL()/cmdDelHook() — only the redundant disk write is skipped. On restart, the follower replays the leader's AOF which contains the original SET/SETHOOK commands with TTLs, and the items re-expire naturally.

Change 3: Fix proc() full-scan bug (hooks.go) — primary fix for #769

Changed the AscendGreaterOrEqual callback to return false when it encounters an entry for a different hook name. This changes each hook's scan from O(total qdb entries) to O(entries for this specific hook). This is the primary fix for the slow leader memory leak — it eliminates the massive allocation churn that causes Go heap fragmentation over time.

Change 4: Kafka connection max lifetime (kafka.go) — precautionary

Added kafkaMaxLifetime = 30 minutes. When a producer connection exceeds this age, it's closed and a fresh one is created on the next send.

This is a defensive measure, not a fix for a confirmed sarama bug. Long-lived connections work fine in sarama, but periodic recycling is reasonable production hygiene for services that run for months.

Impact

Metric Before After
AOF growth Unbounded, multi-GB in minutes Auto-compacted at configurable threshold
Follower AOF writes for expired items Redundant DEL/DELHOOK written Skipped (deterministic re-expiry on restart)
proc() scan per hook O(total qdb entries) O(entries for this hook)
Kafka producer lifetime Unbounded (months) Max 30 minutes, then recycled
Manual cron for AOFSHRINK/GC Required Eliminated with aofshrink-min-size
Leader memory leak over months OOM after 1-1.5 months Fixed — root cause (proc scan bug) eliminated

How to enable auto-AOFSHRINK

```
CONFIG SET aofshrink-min-size 256mb
CONFIG REWRITE
```

The value accepts the same format as maxmemory (kb, mb, gb suffixes or raw bytes). Set to 0 to disable (default).

Test plan

  • Verify auto-AOFSHRINK triggers when AOF exceeds threshold
  • Verify auto-AOFSHRINK does not trigger when disabled (default 0)
  • Verify auto-AOFSHRINK respects 1-minute cooldown between triggers
  • Verify follower correctly expires objects/hooks without writing to AOF
  • Verify follower restarts correctly (TTL items re-expire from replayed AOF)
  • Verify hook proc() only processes entries for its own hook name
  • Verify Kafka connections are recycled after 30 minutes under sustained load
  • Load test with 70K+ hooks to confirm memory stability over extended run

This commit addresses two related issues affecting tile38 at scale
with many Kafka hooks (tidwall#685, tidwall#769):

1. Auto-AOFSHRINK: Add configurable automatic AOF compaction via the
   new `aofshrink-min-size` config property. When set (e.g. "256mb"),
   a background goroutine monitors AOF size and triggers AOFSHRINK
   automatically, eliminating the need for external cron jobs.

2. Skip redundant AOF writes on followers: Followers can deterministically
   expire items via TTL without writing DEL/DELHOOK commands to their AOF.
   This prevents unnecessary AOF growth on follower nodes.

3. Fix hook proc() full-scan bug: The webhook queue processor was scanning
   the entire buntdb index instead of stopping after processing entries
   for the current hook. With 70K+ hooks, each proc() call performed an
   O(N) scan of all remaining entries, causing massive allocation pressure
   and heap fragmentation over time — the root cause of the slow leader
   memory leak.

4. Kafka connection recycling: Add a 30-minute maximum lifetime to Kafka
   producer connections. Previously, connections were only recycled after
   30 seconds of inactivity, which never occurred under sustained load.
   This is a precautionary measure for long-running services.

Made-with: Cursor
@Mukund2900 Mukund2900 force-pushed the fix/memory-leak-and-aof-growth branch from 6cfa2b6 to f551bfd Compare March 17, 2026 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heap size not coming down after objects are removed

1 participant