Skip to content

File source identity issue: checksum collisions and inode reuse can cause skipped data #25079

@Rentu

Description

@Rentu

A note for the community

file source currently has two fingerprint modes:

checksum (first N lines)
device_and_inode
Both can fail in real production cases:

Checksum mode issue
Different files can share the same first N lines, so they get the same fingerprint.
Vector may switch watcher path by mtime but keep old offset/checkpoint, which can skip data in the new file.

Inode mode issue
Inode can be reused after file deletion/rotation under high churn.
A new file may appear with the same (device,inode) but different content generation.
Reusing old checkpoint offset for this new generation can also skip data.

So checksum-only is unsafe for “same file” identity, and inode-only is also unsafe under inode reuse.

Feature request
Please add a safer composite file identity mode, for example:

primary key: device + inode
plus content generation validation (e.g. header checksum/first bytes in checkpoint)
if validation fails, treat as new file generation and reset resume offset safely
This would reduce data loss/skip risks while keeping backward compatibility (opt-in mode is fine).

Use Cases

No response

Attempted Solutions

No response

Proposal

No response

References

No response

Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions