Skip to content

Add Log::RateLimitBackend to suppress repeated log entries#1764

Draft
baelter wants to merge 1 commit intomainfrom
feature/auth-failure-rate-limiting
Draft

Add Log::RateLimitBackend to suppress repeated log entries#1764
baelter wants to merge 1 commit intomainfrom
feature/auth-failure-rate-limiting

Conversation

@baelter
Copy link
Member

@baelter baelter commented Feb 28, 2026

Summary

  • Add Log::RateLimitBackend, a Log::Backend wrapper that suppresses repeated log entries within a configurable window
  • Wired up globally in Config#setup_logger wrapping the existing broadcast backend
  • No changes to business logic — connection factories call log.info { "Authentication failure..." } as normal

How it works

  • Keys on {source, message} (null-byte delimited)
  • First occurrence: passes through immediately
  • Subsequent occurrences within the window (default 5s): suppressed and counted
  • When window expires and the next entry arrives: emits a summary ("… suppressed N times") then the new entry
  • Capped at 10,000 tracked keys; entries beyond that fail-open (pass through unconditionally)
  • Periodic cleanup fiber removes stale keys after 60s inactivity

Motivation

Under connection churn, a client with wrong credentials can generate hundreds of auth failure log lines per second. This wastes CPU on log formatting and I/O. Rather than tracking this in each connection factory, the suppression is handled transparently by the logging layer and applies to any log source.

Test plan

  • make test SPEC=spec/log/rate_limit_backend_spec.cr (9 specs)
  • Manual test: connect with invalid credentials repeatedly, verify log suppression

@claude

This comment was marked as resolved.

@claude
Copy link

claude bot commented Feb 28, 2026

PR Review: Auth failure rate-limiting

No issues found.

The implementation is correct and well-structured:

  • Entry is a struct, and the code properly writes copies back to the hash after mutation.
  • Time.instant (monotonic clock) is the right choice for measuring elapsed time.
  • The Key record provides correct equality and hashing semantics for use as a hash key.
  • The suppression logic correctly tracks and reports suppressed counts.
  • String allocations from Key.new are acceptable since auth failures are not a hot path.

@carlhoerberg
Copy link
Member

We could take it further, make it more generic, by hashing each log message, in Log, compare it to a size limited array/hash/set, and not print repeated log messages, much like syslog does.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from ed46190 to d846f0c Compare March 2, 2026 08:52
@baelter
Copy link
Member Author

baelter commented Mar 2, 2026

Made it generic but only configured it for auth logs since it adds a performance cost on every log call.

@claude

This comment was marked as resolved.

@claude

This comment was marked as resolved.

@claude
Copy link

claude bot commented Mar 2, 2026

No issues found.

@baelter
Copy link
Member Author

baelter commented Mar 2, 2026

Maybe push this into a log backend/dispatcher instead?

The motivation looks faulty, followers dont start amqp listeners.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from feeea82 to 1c0271c Compare March 3, 2026 08:48
@baelter baelter changed the title Rate-limit auth failure logging Add Log::RateLimitBackend to suppress repeated log entries Mar 3, 2026
@claude

This comment was marked as resolved.

@claude

This comment was marked as resolved.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from 1c0271c to f69694a Compare March 3, 2026 09:00
@claude

This comment was marked as resolved.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from f69694a to 18a536b Compare March 3, 2026 09:04
@claude

This comment was marked as resolved.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from 18a536b to af12b9c Compare March 3, 2026 09:12
@claude

This comment was marked as resolved.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from af12b9c to a3868c0 Compare March 3, 2026 13:40
@claude

This comment was marked as resolved.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from a3868c0 to ca3ad00 Compare March 3, 2026 14:56
@claude

This comment was marked as resolved.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from ca3ad00 to b3407ea Compare March 3, 2026 15:31
@claude

This comment was marked as resolved.

@claude

This comment was marked as resolved.

@claude
Copy link

claude bot commented Mar 4, 2026

PR Review: Log::RateLimitBackend

1. Phantom source "auth.handler" — no log entries will ever match

File: src/lavinmq/config.cr:289

The configured source "auth.handler" does not exist anywhere in the codebase. The actual auth-related log sources are:

  • auth.authenticator (in src/lavinmq/auth/authenticator.cr)
  • auth.chain (in src/lavinmq/auth/chain.cr)

Since RateLimitBackend uses exact string matching (@sources.includes?(entry.source)), the "auth.handler" filter is dead code and no auth backend log entries will be rate-limited.

2. Rate limiting keyed on full message defeats brute-force protection

File: src/lavinmq/log/rate_limit_backend.cr:23

The rate limiter uses {source, message} as the dedup key. Auth failure messages embed the username (e.g., "Authentication failure for user \"alice\""), so each distinct username creates a separate bucket. Consequences:

  • A dictionary/brute-force attack trying different usernames is never rate-limited — every attempt has a unique key and passes through.
  • With MAX_ENTRIES = 10_000, an attacker sending 10,000+ unique usernames fills the hash. Once full, all subsequent entries (even repeated ones) pass through un-rate-limited due to the fail-open behavior at capacity.

If the goal is to suppress log spam from auth failures (as the branch name suggests), consider keying on a normalized prefix or just the source, rather than the full message.

3. Race between cleanup and close drops suppression summaries

File: src/lavinmq/log/rate_limit_backend.cr:100-112

In cleanup, stale entries with pending suppressed counts are removed from @states and collected into to_flush inside the lock. After releasing the lock, flush_summaries(to_flush) is guarded by unless @closed. If close runs between the lock release and that check (setting @closed = true and flushing only the entries still in @states), the summaries in to_flush are silently lost — cleanup skips flushing them, and close never saw them because cleanup already removed them.

This is a narrow window that only matters during shutdown or log rotation, so it is minor, but it could be fixed by moving the closed check inside the lock or flushing to_flush unconditionally.

@claude
Copy link

claude bot commented Mar 4, 2026

PR Review: Log Rate Limit Backend

1. Phantom source "auth.handler" (Bug)

In src/lavinmq/config.cr:289, the rate-limited sources list includes "auth.handler":

sources: ["amqp.connection_factory", "mqtt.connection_factory", "auth.handler"]

No logger in the codebase uses the source "auth.handler". The actual auth-related sources are:

  • "auth.authenticator" (src/lavinmq/auth/authenticator.cr:6)
  • "auth.chain" (src/lavinmq/auth/chain.cr:11)
  • "oauth2" (src/lavinmq/auth/authenticators/oauth.cr:9)

Since RateLimitBackend uses exact string matching (@sources.includes?(entry.source)), the "auth.handler" entry is dead config that will never match any log entry. This is not currently a functional problem because the main auth failure messages are logged from the connection factory sources (which are covered), but it is misleading and would silently fail to rate-limit if auth failure logging is ever moved or added under an auth.* source.

Fix: replace "auth.handler" with the intended source name(s), or remove it if auth failure logs are only expected from the connection factories.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from 0b9f9b6 to ddac481 Compare March 4, 2026 09:48
Adds a Log::Backend wrapper that suppresses repeated log entries within
a configurable window. The first occurrence of each {source, message}
pair passes through immediately. Subsequent occurrences within the window
are counted but not forwarded. When the window expires and a new entry
arrives, a summary ("… suppressed N times: <message>") is emitted first,
followed by the new entry. Pending counts are also flushed on close and
during periodic cleanup of stale keys.

Wired up in Config#setup_logger scoped to amqp.connection_factory,
mqtt.connection_factory, and auth.handler sources only, so unrelated
log traffic has no overhead.

Motivation: under connection churn, a client with wrong credentials can
generate hundreds of auth failure log lines per second, wasting CPU on
formatting and I/O. The suppression is handled transparently by the
logging layer and applies to any log source in the configured list.
@claude
Copy link

claude bot commented Mar 4, 2026

Code Review

1. Critical: Unintentional revert of two recent bug fixes

The PR inadvertently reverts changes from two recently merged PRs. The branch was likely created before these were merged and was rebased/merged incorrectly.

a) Revert of #1772pos = 4u32 reset on segment boundary crossing

In src/lavinmq/amqp/stream/stream_message_store.cr, the removal of pos = 4u32 at line ~107 means that when find_offset_in_segments crosses a segment boundary, pos retains its old value (the previous segment's size). The next iteration then reads BytesMessage.from_bytes(rfile.to_slice + pos) at a wrong offset in the new segment, causing corrupted reads or out-of-bounds access.

b) Revert of #1760segment_last_ts tracking

All segment_last_ts code is removed: the field writes in write_metadata_file, the reads in restore_metadata_file, the scan fallback, and the initialization in new_segment. This re-introduces the bug where messages can be lost on restart when x-max-age is set. The corresponding tests in spec/stream_queue_spec.cr are also deleted.

Neither of these changes is mentioned in the PR description ("No changes to business logic"), confirming they are unintentional.

Fix: Rebase onto current main without dropping the hunks from #1760 and #1772.

2. Race between write and close in RateLimitBackend

In write, the @closed check and @backend.write calls both happen outside the lock:

def write(entry : Log::Entry) : Nil
  return if @closed                    # (1) check without lock
  # ...
  summary, pass_through = @lock.synchronize { update_state(...) }
  @backend.write(summary_entry)        # (3) write without lock
  @backend.write(entry) if pass_through
end

If close is called from another fiber between (1) and (3) — possible if @backend.write in the closing fiber yields during IO — then @backend.close completes before the write fiber's @backend.write, writing to a closed backend.

Fix: Either move the @closed check inside the lock, or guard the @backend.write calls in write with a begin/rescue for the closed-backend case.

@baelter baelter force-pushed the feature/auth-failure-rate-limiting branch from ddac481 to d9b52bc Compare March 4, 2026 11:02
@claude
Copy link

claude bot commented Mar 4, 2026

PR Review: Add Log::RateLimitBackend

1. Branch needs rebase — unintended reversions (Critical)

The diff includes changes to src/lavinmq/amqp/stream/stream_message_store.cr, spec/stream_queue_spec.cr, and static/js/chart.js that are unrelated to the rate-limit feature. Specifically, it reverts:

Both associated specs are also deleted. The branch needs to be rebased on main to incorporate these fixes.

2. Race condition in write vs close

In rate_limit_backend.cr, write checks @closed without holding @lock:

def write(entry : Log::Entry) : Nil
  return if @closed          # <-- unsynchronized read
  # ...
  @backend.write(entry)      # <-- may execute after close finishes
end

Meanwhile, close sets @closed = true inside @lock.synchronize and then calls @backend.close. A fiber could pass the @closed check, yield (e.g., during @lock.synchronize in update_state), close runs to completion (closing @backend), and then the fiber calls @backend.write(entry) on the closed inner backend.

Fix: either check @closed inside the same @lock.synchronize block that calls update_state, or move @backend.close out of close (let the owner manage the inner backend's lifecycle).

3. Rate limiting is trivially bypassed by varying usernames

The rate-limit key is {source, message} where the message includes the interpolated username (e.g., "Authentication failure for user \"alice\""). Each unique username creates a separate bucket, so an attacker rotating usernames generates no duplicates and nothing gets suppressed. Additionally, each unique username consumes a slot in the @states hash (up to MAX_ENTRIES = 10_000). Once full, the backend fails open and all subsequent messages pass through unthrottled.

Consider either:

  • Keying on source alone (or source + a fixed prefix like "Authentication failure")
  • Using a pattern/regex to normalize the message before keying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants