Add Log::RateLimitBackend to suppress repeated log entries by baelter · Pull Request #1764 · cloudamqp/lavinmq

baelter · 2026-02-28T20:39:31Z

Summary

Add Log::RateLimitBackend, a Log::Backend wrapper that suppresses repeated log entries within a configurable window
Wired up globally in Config#setup_logger wrapping the existing broadcast backend
No changes to business logic — connection factories call log.info { "Authentication failure..." } as normal

How it works

Keys on {source, message} (null-byte delimited)
First occurrence: passes through immediately
Subsequent occurrences within the window (default 5s): suppressed and counted
When window expires and the next entry arrives: emits a summary ("… suppressed N times") then the new entry
Capped at 10,000 tracked keys; entries beyond that fail-open (pass through unconditionally)
Periodic cleanup fiber removes stale keys after 60s inactivity

Motivation

Under connection churn, a client with wrong credentials can generate hundreds of auth failure log lines per second. This wastes CPU on log formatting and I/O. Rather than tracking this in each connection factory, the suppression is handled transparently by the logging layer and applies to any log source.

Test plan

make test SPEC=spec/log/rate_limit_backend_spec.cr (9 specs)
Manual test: connect with invalid credentials repeatedly, verify log suppression

claude · 2026-02-28T20:58:40Z

PR Review: Auth failure rate-limiting

No issues found.

The implementation is correct and well-structured:

Entry is a struct, and the code properly writes copies back to the hash after mutation.
Time.instant (monotonic clock) is the right choice for measuring elapsed time.
The Key record provides correct equality and hashing semantics for use as a hash key.
The suppression logic correctly tracks and reports suppressed counts.
String allocations from Key.new are acceptable since auth failures are not a hot path.

carlhoerberg · 2026-03-02T02:18:24Z

We could take it further, make it more generic, by hashing each log message, in Log, compare it to a size limited array/hash/set, and not print repeated log messages, much like syslog does.

baelter · 2026-03-02T08:53:39Z

Made it generic but only configured it for auth logs since it adds a performance cost on every log call.

claude · 2026-03-02T09:11:30Z

No issues found.

baelter · 2026-03-02T13:31:07Z

Maybe push this into a log backend/dispatcher instead?

The motivation looks faulty, followers dont start amqp listeners.

claude · 2026-03-04T09:30:53Z

PR Review: Log::RateLimitBackend

1. Phantom source `"auth.handler"` — no log entries will ever match

File: src/lavinmq/config.cr:289

The configured source "auth.handler" does not exist anywhere in the codebase. The actual auth-related log sources are:

auth.authenticator (in src/lavinmq/auth/authenticator.cr)
auth.chain (in src/lavinmq/auth/chain.cr)

Since RateLimitBackend uses exact string matching (@sources.includes?(entry.source)), the "auth.handler" filter is dead code and no auth backend log entries will be rate-limited.

2. Rate limiting keyed on full message defeats brute-force protection

File: src/lavinmq/log/rate_limit_backend.cr:23

The rate limiter uses {source, message} as the dedup key. Auth failure messages embed the username (e.g., "Authentication failure for user \"alice\""), so each distinct username creates a separate bucket. Consequences:

A dictionary/brute-force attack trying different usernames is never rate-limited — every attempt has a unique key and passes through.
With MAX_ENTRIES = 10_000, an attacker sending 10,000+ unique usernames fills the hash. Once full, all subsequent entries (even repeated ones) pass through un-rate-limited due to the fail-open behavior at capacity.

If the goal is to suppress log spam from auth failures (as the branch name suggests), consider keying on a normalized prefix or just the source, rather than the full message.

3. Race between `cleanup` and `close` drops suppression summaries

File: src/lavinmq/log/rate_limit_backend.cr:100-112

In cleanup, stale entries with pending suppressed counts are removed from @states and collected into to_flush inside the lock. After releasing the lock, flush_summaries(to_flush) is guarded by unless @closed. If close runs between the lock release and that check (setting @closed = true and flushing only the entries still in @states), the summaries in to_flush are silently lost — cleanup skips flushing them, and close never saw them because cleanup already removed them.

This is a narrow window that only matters during shutdown or log rotation, so it is minor, but it could be fixed by moving the closed check inside the lock or flushing to_flush unconditionally.

claude · 2026-03-04T09:32:16Z

PR Review: Log Rate Limit Backend

1. Phantom source `"auth.handler"` (Bug)

In src/lavinmq/config.cr:289, the rate-limited sources list includes "auth.handler":

sources: ["amqp.connection_factory", "mqtt.connection_factory", "auth.handler"]

No logger in the codebase uses the source "auth.handler". The actual auth-related sources are:

"auth.authenticator" (src/lavinmq/auth/authenticator.cr:6)
"auth.chain" (src/lavinmq/auth/chain.cr:11)
"oauth2" (src/lavinmq/auth/authenticators/oauth.cr:9)

Since RateLimitBackend uses exact string matching (@sources.includes?(entry.source)), the "auth.handler" entry is dead config that will never match any log entry. This is not currently a functional problem because the main auth failure messages are logged from the connection factory sources (which are covered), but it is misleading and would silently fail to rate-limit if auth failure logging is ever moved or added under an auth.* source.

Fix: replace "auth.handler" with the intended source name(s), or remove it if auth failure logs are only expected from the connection factories.

Adds a Log::Backend wrapper that suppresses repeated log entries within a configurable window. The first occurrence of each {source, message} pair passes through immediately. Subsequent occurrences within the window are counted but not forwarded. When the window expires and a new entry arrives, a summary ("… suppressed N times: <message>") is emitted first, followed by the new entry. Pending counts are also flushed on close and during periodic cleanup of stale keys. Wired up in Config#setup_logger scoped to amqp.connection_factory, mqtt.connection_factory, and auth.handler sources only, so unrelated log traffic has no overhead. Motivation: under connection churn, a client with wrong credentials can generate hundreds of auth failure log lines per second, wasting CPU on formatting and I/O. The suppression is handled transparently by the logging layer and applies to any log source in the configured list.

claude · 2026-03-04T09:50:57Z

Code Review

1. Critical: Unintentional revert of two recent bug fixes

The PR inadvertently reverts changes from two recently merged PRs. The branch was likely created before these were merged and was rebased/merged incorrectly.

a) Revert of #1772 — pos = 4u32 reset on segment boundary crossing

In src/lavinmq/amqp/stream/stream_message_store.cr, the removal of pos = 4u32 at line ~107 means that when find_offset_in_segments crosses a segment boundary, pos retains its old value (the previous segment's size). The next iteration then reads BytesMessage.from_bytes(rfile.to_slice + pos) at a wrong offset in the new segment, causing corrupted reads or out-of-bounds access.

b) Revert of #1760 — segment_last_ts tracking

All segment_last_ts code is removed: the field writes in write_metadata_file, the reads in restore_metadata_file, the scan fallback, and the initialization in new_segment. This re-introduces the bug where messages can be lost on restart when x-max-age is set. The corresponding tests in spec/stream_queue_spec.cr are also deleted.

Neither of these changes is mentioned in the PR description ("No changes to business logic"), confirming they are unintentional.

Fix: Rebase onto current main without dropping the hunks from #1760 and #1772.

2. Race between `write` and `close` in `RateLimitBackend`

In write, the @closed check and @backend.write calls both happen outside the lock:

def write(entry : Log::Entry) : Nil
  return if @closed                    # (1) check without lock
  # ...
  summary, pass_through = @lock.synchronize { update_state(...) }
  @backend.write(summary_entry)        # (3) write without lock
  @backend.write(entry) if pass_through
end

If close is called from another fiber between (1) and (3) — possible if @backend.write in the closing fiber yields during IO — then @backend.close completes before the write fiber's @backend.write, writing to a closed backend.

Fix: Either move the @closed check inside the lock, or guard the @backend.write calls in write with a begin/rescue for the closed-backend case.

claude · 2026-03-04T11:05:32Z

PR Review: Add Log::RateLimitBackend

1. Branch needs rebase — unintended reversions (Critical)

The diff includes changes to src/lavinmq/amqp/stream/stream_message_store.cr, spec/stream_queue_spec.cr, and static/js/chart.js that are unrelated to the rate-limit feature. Specifically, it reverts:

PR Fix: reset pos when crossing segment boundary in find_offset_in_segments #1772 (pos = 4u32 reset when crossing segment boundary in find_offset_in_segments) — removing this line means that when find_offset_in_segments crosses a segment boundary, pos retains the old segment's end position instead of resetting to 4u32 (past the header). The subsequent BytesMessage.from_bytes(rfile.to_slice + pos) on line 113 will read at a wrong offset in the new segment, causing data corruption or crashes.
PR fix: persist segment_last_ts to prevent message loss on restart with max-age #1760 (segment_last_ts persistence to prevent message loss on restart with max-age) — all @segment_last_ts tracking, scan_last_ts, and the metadata read/write for last_ts are removed.

Both associated specs are also deleted. The branch needs to be rebased on main to incorporate these fixes.

2. Race condition in `write` vs `close`

In rate_limit_backend.cr, write checks @closed without holding @lock:

def write(entry : Log::Entry) : Nil
  return if @closed          # <-- unsynchronized read
  # ...
  @backend.write(entry)      # <-- may execute after close finishes
end

Meanwhile, close sets @closed = true inside @lock.synchronize and then calls @backend.close. A fiber could pass the @closed check, yield (e.g., during @lock.synchronize in update_state), close runs to completion (closing @backend), and then the fiber calls @backend.write(entry) on the closed inner backend.

Fix: either check @closed inside the same @lock.synchronize block that calls update_state, or move @backend.close out of close (let the owner manage the inner backend's lifecycle).

3. Rate limiting is trivially bypassed by varying usernames

The rate-limit key is {source, message} where the message includes the interpolated username (e.g., "Authentication failure for user \"alice\""). Each unique username creates a separate bucket, so an attacker rotating usernames generates no duplicates and nothing gets suppressed. Additionally, each unique username consumes a slot in the @states hash (up to MAX_ENTRIES = 10_000). Once full, the backend fails open and all subsequent messages pass through unthrottled.

Consider either:

Keying on source alone (or source + a fixed prefix like "Authentication failure")
Using a pattern/regex to normalize the message before keying