fix: address AOF unbounded growth and slow leader memory leak (#685, #769) by Mukund2900 · Pull Request #802 · tidwall/tile38

Mukund2900 · 2026-03-16T10:40:10Z

Summary

This PR addresses two related issues affecting tile38 at scale with many Kafka hooks:

Heap size not coming down after objects are removed #685: tile38_aof_size_bytes grows to multiple GB in minutes, requiring manual AOFSHRINK + GC via external cron
Free memory of leader instance comes down to 10% every 1-1.5 months #769: Free memory on the leader slowly decreases over 1-1.5 months until the instance goes OOM, requiring a restart

Both issues share the same root architecture: tile38's hook processing at high throughput (thousands of SETs/sec with 70K+ hooks) creates unbounded AOF growth and massive allocation pressure that fragments the Go heap over time.

What's wrong today

1. No automatic AOF compaction

Every write command (SET, SETHOOK, DEL, DELHOOK) is appended to the AOF. With thousands of vehicle SETs per second and 70K+ geofence hooks, the AOF grows to several GB in minutes. There is no built-in mechanism to automatically compact it — users must run AOFSHRINK manually or via external cron jobs.

2. Redundant AOF writes on followers

When objects/hooks expire via TTL, backgroundExpireObjects() and backgroundExpireHooks() write DEL/DELHOOK commands to the AOF on both leader and followers. Followers don't need these writes — they can deterministically compute the same expirations from the original commands (which contain the TTL). These redundant writes cause the follower's AOF to bloat just as fast as the leader's.

3. Hook proc() scans the entire qdb index (BUG — primary root cause of #769)

Each hook's proc() function (webhook queue processor) calls tx.AscendGreaterOrEqual(\"hooks\", h.query, callback) where the callback always returns true. The buntdb "hooks" index is sorted by hook name, so entries for each hook are contiguous. But because the callback never returns false, every proc() call scans all remaining entries in the entire index past this hook's position.

With 70K hooks and thousands of pending events:

Each proc() call performs an O(N) scan instead of O(entries-for-this-hook)
It parses JSON (gjson.Get) for every unrelated entry
It creates temporary string copies for every scanned value

The 70K manager goroutines running concurrently create enormous allocation pressure. Over weeks, Go's non-compacting garbage collector fragments the heap — pages with mixed live/dead objects can't be returned to the OS, so process RSS grows while HeapAlloc stays stable. This is why the leader's free memory slowly decreases over 1-1.5 months until OOM.

This only affects the leader because queueHooks() (which populates qdb) only runs on the leader. Followers' hook goroutines are idle — they never call proc().

4. Kafka connections are never recycled under sustained load

KafkaConn expires after 30 seconds of inactivity. With 70K hooks constantly sending events, the connection is never idle — the same sarama SyncProducer runs for the entire lifetime of the process (potentially months).

Note: This is a defensive/precautionary fix. There is no confirmed memory leak in the sarama library for long-lived producers — sarama's metadata refresh allocations are GC-eligible, its batching buffers are bounded, and its metrics registry is stable for a single producer instance. However, periodically recycling long-lived connections is reasonable production hygiene.

Changes

Change 1: Auto-AOFSHRINK (aofshrink.go, config.go, server.go) — fixes #685

New `aofshrink-min-size` config property. When set (e.g. `CONFIG SET aofshrink-min-size 256mb`), a background goroutine checks AOF size every 10 seconds and triggers AOFSHRINK automatically when the threshold is exceeded, with a 1-minute cooldown. Disabled by default (value 0).

Change 2: Skip AOF writes for TTL expirations on followers (expire.go) — fixes #685 on followers

Skip writeAOF() in backgroundExpireObjects() and backgroundExpireHooks() when running as a follower. In-memory state is still updated by cmdDEL()/cmdDelHook() — only the redundant disk write is skipped. On restart, the follower replays the leader's AOF which contains the original SET/SETHOOK commands with TTLs, and the items re-expire naturally.

Change 3: Fix proc() full-scan bug (hooks.go) — primary fix for #769

Changed the AscendGreaterOrEqual callback to return false when it encounters an entry for a different hook name. This changes each hook's scan from O(total qdb entries) to O(entries for this specific hook). This is the primary fix for the slow leader memory leak — it eliminates the massive allocation churn that causes Go heap fragmentation over time.

Change 4: Kafka connection max lifetime (kafka.go) — precautionary

Added kafkaMaxLifetime = 30 minutes. When a producer connection exceeds this age, it's closed and a fresh one is created on the next send.

This is a defensive measure, not a fix for a confirmed sarama bug. Long-lived connections work fine in sarama, but periodic recycling is reasonable production hygiene for services that run for months.

Impact

Metric	Before	After
AOF growth	Unbounded, multi-GB in minutes	Auto-compacted at configurable threshold
Follower AOF writes for expired items	Redundant DEL/DELHOOK written	Skipped (deterministic re-expiry on restart)
proc() scan per hook	O(total qdb entries)	O(entries for this hook)
Kafka producer lifetime	Unbounded (months)	Max 30 minutes, then recycled
Manual cron for AOFSHRINK/GC	Required	Eliminated with aofshrink-min-size
Leader memory leak over months	OOM after 1-1.5 months	Fixed — root cause (proc scan bug) eliminated

How to enable auto-AOFSHRINK

```
CONFIG SET aofshrink-min-size 256mb
CONFIG REWRITE
```

The value accepts the same format as maxmemory (kb, mb, gb suffixes or raw bytes). Set to 0 to disable (default).

Test plan

Verify auto-AOFSHRINK triggers when AOF exceeds threshold
Verify auto-AOFSHRINK does not trigger when disabled (default 0)
Verify auto-AOFSHRINK respects 1-minute cooldown between triggers
Verify follower correctly expires objects/hooks without writing to AOF
Verify follower restarts correctly (TTL items re-expire from replayed AOF)
Verify hook proc() only processes entries for its own hook name
Verify Kafka connections are recycled after 30 minutes under sustained load
Load test with 70K+ hooks to confirm memory stability over extended run

This commit addresses two related issues affecting tile38 at scale with many Kafka hooks (tidwall#685, tidwall#769): 1. Auto-AOFSHRINK: Add configurable automatic AOF compaction via the new `aofshrink-min-size` config property. When set (e.g. "256mb"), a background goroutine monitors AOF size and triggers AOFSHRINK automatically, eliminating the need for external cron jobs. 2. Skip redundant AOF writes on followers: Followers can deterministically expire items via TTL without writing DEL/DELHOOK commands to their AOF. This prevents unnecessary AOF growth on follower nodes. 3. Fix hook proc() full-scan bug: The webhook queue processor was scanning the entire buntdb index instead of stopping after processing entries for the current hook. With 70K+ hooks, each proc() call performed an O(N) scan of all remaining entries, causing massive allocation pressure and heap fragmentation over time — the root cause of the slow leader memory leak. 4. Kafka connection recycling: Add a 30-minute maximum lifetime to Kafka producer connections. Previously, connections were only recycled after 30 seconds of inactivity, which never occurred under sustained load. This is a precautionary measure for long-running services. Made-with: Cursor

Mukund2900 force-pushed the fix/memory-leak-and-aof-growth branch from 6cfa2b6 to f551bfd Compare March 17, 2026 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: address AOF unbounded growth and slow leader memory leak (#685, #769)#802

fix: address AOF unbounded growth and slow leader memory leak (#685, #769)#802
Mukund2900 wants to merge 1 commit intotidwall:masterfrom
Mukund2900:fix/memory-leak-and-aof-growth

Mukund2900 commented Mar 16, 2026 •

edited by tidwall

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mukund2900 commented Mar 16, 2026 • edited by tidwall Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's wrong today

1. No automatic AOF compaction

2. Redundant AOF writes on followers

3. Hook proc() scans the entire qdb index (BUG — primary root cause of #769)

4. Kafka connections are never recycled under sustained load

Changes

Change 1: Auto-AOFSHRINK (aofshrink.go, config.go, server.go) — fixes #685

Change 2: Skip AOF writes for TTL expirations on followers (expire.go) — fixes #685 on followers

Change 3: Fix proc() full-scan bug (hooks.go) — primary fix for #769

Change 4: Kafka connection max lifetime (kafka.go) — precautionary

Impact

How to enable auto-AOFSHRINK

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mukund2900 commented Mar 16, 2026 •

edited by tidwall

Loading