-
Notifications
You must be signed in to change notification settings - Fork 2.1k
File source identity issue: checksum collisions and inode reuse can cause skipped data #25079
Description
A note for the community
file source currently has two fingerprint modes:
checksum (first N lines)
device_and_inode
Both can fail in real production cases:
Checksum mode issue
Different files can share the same first N lines, so they get the same fingerprint.
Vector may switch watcher path by mtime but keep old offset/checkpoint, which can skip data in the new file.
Inode mode issue
Inode can be reused after file deletion/rotation under high churn.
A new file may appear with the same (device,inode) but different content generation.
Reusing old checkpoint offset for this new generation can also skip data.
So checksum-only is unsafe for “same file” identity, and inode-only is also unsafe under inode reuse.
Feature request
Please add a safer composite file identity mode, for example:
primary key: device + inode
plus content generation validation (e.g. header checksum/first bytes in checkpoint)
if validation fails, treat as new file generation and reset resume offset safely
This would reduce data loss/skip risks while keeping backward compatibility (opt-in mode is fine).
Use Cases
No response
Attempted Solutions
No response
Proposal
No response
References
No response
Version
No response