Canary: Duplicate logs on reconnection

**Describe the bug**

Context:

Loki is configured to drop websocket connections after [tail_max_duration](https://grafana.com/docs/loki/latest/reference/loki-config-ref/#querier). The default of `1h` means that Loki Canary has to reconnect hourly. When Loki Canary reconnects, it does not continue reading logs from where it left off, but starts in the past (rereading logs).

Problems:

- This causes many Loki Canary error logs: `received an unexpected entry` followed by `received a duplicate entry`.
- The `loki_canary_duplicate_entries_total` metric is increased for each `received a duplicate entry` log. This is unexpected behaviour because there are no duplicate entries, just the same entry read twice. This renders the `loki_canary_duplicate_entries_total` metric unusable because the metric is puluted from reconnections

Details:

- The Loki `tail_max_duration` configuration determines how frequently the `loki_canary_duplicate_entries_total` value increases
- The Loki Canary `-interval` flags affect the number of duplicate entries. The lower the value, the more duplicate entries written per `tail_max_duration`.
- The Loki Canary `-wait` flag seems to affect how far in the past it starts reading logs from. The higher the value, the more duplicate entries written per `tail_max_duration`.

Examples:

- `-interval` of `1s` and `-wait` of `1m` (60 logs per 1 minute), my `loki_canary_duplicate_entries_total` metric increased by ~60 per `tail_max_duration`
- `-interval` of `10s` and `-wait` of `5m` (30 logs per 5 minutes), my `loki_canary_duplicate_entries_total` metric increased by ~30 per `tail_max_duration`
- `-interval` of `10s` and `-wait` of `2m` (12 logs per 2 minutes), my `loki_canary_duplicate_entries_total` metric increased by ~12 per `tail_max_duration`

**To Reproduce**
Steps to reproduce the behavior:
1. Started Loki (3.6.7)
2. Started Loki Canary (3.6.7)
3. Wait `tail_max_duration` (Loki configuration), causing Loki to drop Loki Canary's websocket connection
4. See Loki Canary error logs: `received an unexpected entry` followed by `received a duplicate entry`
5. See `loki_canary_duplicate_entries_total` metric increase

**Expected behavior**

- `received an unexpected entry` & `received a duplicate entry` errors do not occur on reconnection
- `loki_canary_duplicate_entries_total` is not increased on reconnection, resulting in accurate graphs and usable alerts

Possible solutions:

1. Keep track of the last time stamp. When recconection occurs, start reading the logs from the last time stamp
2. Continue to reread old logs, but do not increase the `loki_canary_duplicate_entries_total` metric when logs that have been encountered are met. `loki_canary_duplicate_entries_total` should only be incremented when the same log appears twice, not when a single log is read twice due to a reconnect.
3. Update Loki Canary documentation to recommend Loki's `tail_max_duration` value be high enough that reconnection never occurs. This solution is not preferred because it may cause other issues as outlined in the original [pull request](https://github.com/grafana/loki/pull/756).


**Environment:**

Loki & Loki Canary installed in monolithic mode on a VPS running Debian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canary: Duplicate logs on reconnection #21232

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Canary: Duplicate logs on reconnection #21232

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions