Skip to content

Canary: Duplicate logs on reconnection #21232

@ScottWager

Description

@ScottWager

Describe the bug

Context:

Loki is configured to drop websocket connections after tail_max_duration. The default of 1h means that Loki Canary has to reconnect hourly. When Loki Canary reconnects, it does not continue reading logs from where it left off, but starts in the past (rereading logs).

Problems:

  • This causes many Loki Canary error logs: received an unexpected entry followed by received a duplicate entry.
  • The loki_canary_duplicate_entries_total metric is increased for each received a duplicate entry log. This is unexpected behaviour because there are no duplicate entries, just the same entry read twice. This renders the loki_canary_duplicate_entries_total metric unusable because the metric is puluted from reconnections

Details:

  • The Loki tail_max_duration configuration determines how frequently the loki_canary_duplicate_entries_total value increases
  • The Loki Canary -interval flags affect the number of duplicate entries. The lower the value, the more duplicate entries written per tail_max_duration.
  • The Loki Canary -wait flag seems to affect how far in the past it starts reading logs from. The higher the value, the more duplicate entries written per tail_max_duration.

Examples:

  • -interval of 1s and -wait of 1m (60 logs per 1 minute), my loki_canary_duplicate_entries_total metric increased by ~60 per tail_max_duration
  • -interval of 10s and -wait of 5m (30 logs per 5 minutes), my loki_canary_duplicate_entries_total metric increased by ~30 per tail_max_duration
  • -interval of 10s and -wait of 2m (12 logs per 2 minutes), my loki_canary_duplicate_entries_total metric increased by ~12 per tail_max_duration

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (3.6.7)
  2. Started Loki Canary (3.6.7)
  3. Wait tail_max_duration (Loki configuration), causing Loki to drop Loki Canary's websocket connection
  4. See Loki Canary error logs: received an unexpected entry followed by received a duplicate entry
  5. See loki_canary_duplicate_entries_total metric increase

Expected behavior

  • received an unexpected entry & received a duplicate entry errors do not occur on reconnection
  • loki_canary_duplicate_entries_total is not increased on reconnection, resulting in accurate graphs and usable alerts

Possible solutions:

  1. Keep track of the last time stamp. When recconection occurs, start reading the logs from the last time stamp
  2. Continue to reread old logs, but do not increase the loki_canary_duplicate_entries_total metric when logs that have been encountered are met. loki_canary_duplicate_entries_total should only be incremented when the same log appears twice, not when a single log is read twice due to a reconnect.
  3. Update Loki Canary documentation to recommend Loki's tail_max_duration value be high enough that reconnection never occurs. This solution is not preferred because it may cause other issues as outlined in the original pull request.

Environment:

Loki & Loki Canary installed in monolithic mode on a VPS running Debian

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions