Describe the bug
Context:
Loki is configured to drop websocket connections after tail_max_duration. The default of 1h means that Loki Canary has to reconnect hourly. When Loki Canary reconnects, it does not continue reading logs from where it left off, but starts in the past (rereading logs).
Problems:
- This causes many Loki Canary error logs:
received an unexpected entry followed by received a duplicate entry.
- The
loki_canary_duplicate_entries_total metric is increased for each received a duplicate entry log. This is unexpected behaviour because there are no duplicate entries, just the same entry read twice. This renders the loki_canary_duplicate_entries_total metric unusable because the metric is puluted from reconnections
Details:
- The Loki
tail_max_duration configuration determines how frequently the loki_canary_duplicate_entries_total value increases
- The Loki Canary
-interval flags affect the number of duplicate entries. The lower the value, the more duplicate entries written per tail_max_duration.
- The Loki Canary
-wait flag seems to affect how far in the past it starts reading logs from. The higher the value, the more duplicate entries written per tail_max_duration.
Examples:
-interval of 1s and -wait of 1m (60 logs per 1 minute), my loki_canary_duplicate_entries_total metric increased by ~60 per tail_max_duration
-interval of 10s and -wait of 5m (30 logs per 5 minutes), my loki_canary_duplicate_entries_total metric increased by ~30 per tail_max_duration
-interval of 10s and -wait of 2m (12 logs per 2 minutes), my loki_canary_duplicate_entries_total metric increased by ~12 per tail_max_duration
To Reproduce
Steps to reproduce the behavior:
- Started Loki (3.6.7)
- Started Loki Canary (3.6.7)
- Wait
tail_max_duration (Loki configuration), causing Loki to drop Loki Canary's websocket connection
- See Loki Canary error logs:
received an unexpected entry followed by received a duplicate entry
- See
loki_canary_duplicate_entries_total metric increase
Expected behavior
received an unexpected entry & received a duplicate entry errors do not occur on reconnection
loki_canary_duplicate_entries_total is not increased on reconnection, resulting in accurate graphs and usable alerts
Possible solutions:
- Keep track of the last time stamp. When recconection occurs, start reading the logs from the last time stamp
- Continue to reread old logs, but do not increase the
loki_canary_duplicate_entries_total metric when logs that have been encountered are met. loki_canary_duplicate_entries_total should only be incremented when the same log appears twice, not when a single log is read twice due to a reconnect.
- Update Loki Canary documentation to recommend Loki's
tail_max_duration value be high enough that reconnection never occurs. This solution is not preferred because it may cause other issues as outlined in the original pull request.
Environment:
Loki & Loki Canary installed in monolithic mode on a VPS running Debian
Describe the bug
Context:
Loki is configured to drop websocket connections after tail_max_duration. The default of
1hmeans that Loki Canary has to reconnect hourly. When Loki Canary reconnects, it does not continue reading logs from where it left off, but starts in the past (rereading logs).Problems:
received an unexpected entryfollowed byreceived a duplicate entry.loki_canary_duplicate_entries_totalmetric is increased for eachreceived a duplicate entrylog. This is unexpected behaviour because there are no duplicate entries, just the same entry read twice. This renders theloki_canary_duplicate_entries_totalmetric unusable because the metric is puluted from reconnectionsDetails:
tail_max_durationconfiguration determines how frequently theloki_canary_duplicate_entries_totalvalue increases-intervalflags affect the number of duplicate entries. The lower the value, the more duplicate entries written pertail_max_duration.-waitflag seems to affect how far in the past it starts reading logs from. The higher the value, the more duplicate entries written pertail_max_duration.Examples:
-intervalof1sand-waitof1m(60 logs per 1 minute), myloki_canary_duplicate_entries_totalmetric increased by ~60 pertail_max_duration-intervalof10sand-waitof5m(30 logs per 5 minutes), myloki_canary_duplicate_entries_totalmetric increased by ~30 pertail_max_duration-intervalof10sand-waitof2m(12 logs per 2 minutes), myloki_canary_duplicate_entries_totalmetric increased by ~12 pertail_max_durationTo Reproduce
Steps to reproduce the behavior:
tail_max_duration(Loki configuration), causing Loki to drop Loki Canary's websocket connectionreceived an unexpected entryfollowed byreceived a duplicate entryloki_canary_duplicate_entries_totalmetric increaseExpected behavior
received an unexpected entry&received a duplicate entryerrors do not occur on reconnectionloki_canary_duplicate_entries_totalis not increased on reconnection, resulting in accurate graphs and usable alertsPossible solutions:
loki_canary_duplicate_entries_totalmetric when logs that have been encountered are met.loki_canary_duplicate_entries_totalshould only be incremented when the same log appears twice, not when a single log is read twice due to a reconnect.tail_max_durationvalue be high enough that reconnection never occurs. This solution is not preferred because it may cause other issues as outlined in the original pull request.Environment:
Loki & Loki Canary installed in monolithic mode on a VPS running Debian