-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: report accurate line numbers for chunked file scanning (#1876) #4615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ecurity#1876) Files processed through handlers are split into ~10KB chunks, but line numbers weren't tracked across boundaries, causing secrets to always report line 0. - Add LineNumber field to DataOrErr to propagate line context - Add ContentSize() to ChunkResult to avoid double-counting peek overlap - Clone metadata per chunk via populateChunkLineNumber() to prevent shared state
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR - it is beautifully precise, detailed, and documented. In addition to the few inline comments I left, I have this question: Does this change generated by the default handler to be 1-indexed instead of 0-indexed?
pkg/sources/chunker.go
Outdated
| } | ||
|
|
||
| // NewChunkResult creates a ChunkResult with the given data and content size. | ||
| // This is primarily used for testing to create mock chunk results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to be primarily used for this? It looks like we could use this constructor in the actual program logic, too. We're (slowly, gradually) trying to reduce the number of identical things done multiple different ways, so I think that if there's only one place in the program logic that ChunkResults are created, it would be a good idea to use this constructor there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it should be more idiomatic especially since it's a little weird to use public constructor just for testing. I committed a change
| case *source_metadatapb.MetaData_Github: | ||
| m.Github.Line = lineNumber | ||
| case *source_metadatapb.MetaData_Gitlab: | ||
| m.Gitlab.Line = lineNumber | ||
| case *source_metadatapb.MetaData_Git: | ||
| m.Git.Line = lineNumber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR says that these sources are "not affected" by this change. Are they included here just for completeness?
(To be clear, I would support this - although I think a comment warning about the oddity would be a good idea.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling this out! You're right, PR description was a bit inaccurate. I've added it for completeness because some git logic used it. I retested the binary file handling logic for git-based sources (local git, GitHub, GitLab) and confirmed this issue affects them as well.
Initially, I wasn't sure if
trufflehog/pkg/gitparse/gitparse.go
Lines 474 to 489 in 50aa469
| case isBinaryLine(latestState, line): | |
| latestState = BinaryFileLine | |
| path, ok := pathFromBinaryLine(line) | |
| if !ok { | |
| err = fmt.Errorf(`expected line to match 'Binary files a/fileA and b/fileB differ', got "%s"`, line) | |
| ctx.Logger().Error(err, "Failed to parse BinaryFileLine") | |
| latestState = ParseFailure | |
| continue | |
| } | |
| // Don't do anything if the file is deleted. (pathA has file path, pathB is /dev/null) | |
| if path != "" { | |
| currentDiff.PathB = path | |
| currentDiff.IsBinary = true | |
| } |
(Note: the commits are unverified — I used a random username for testing purposes.)
I took the same file from the original issue, packed it into a tarball, and committed it. Results:
Before fix
...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Version: 2
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: content/secret.txt
Line: 557 <- uses git chunker so it's correctly infrred
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/content/secret.txt#L557
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Timestamp: 2026-01-07 01:32:05 +0000
...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Version: 2
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: secret.tar.gz
Line: 288 <- uses HandleBinary(thus HandleFile which had a bug with chunk line numbers)
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/secret.tar.gz#L288
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Repository_local_path: /var/folders/hw/r5j4bcyd3472ccwjz0klh5840000gn/T/trufflehog-40414-225149517
Timestamp: 2026-01-07 01:32:05 +0000
...After fix
...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Version: 2
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: content/secret.txt
Line: 557
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/content/secret.txt#L557
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Timestamp: 2026-01-07 01:32:05 +0000
...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Version: 2
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: secret.tar.gz
Line: 557 <- same line number
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/secret.tar.gz#L557
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Repository_local_path: /var/folders/hw/r5j4bcyd3472ccwjz0klh5840000gn/T/trufflehog-40569-2815386483
Timestamp: 2026-01-07 01:32:05 +0000
...This applies to binary files in git, github and gitlab
trufflehog/pkg/sources/git/git.go
Lines 799 to 808 in 50aa469
| if err := HandleBinary(ctx, gitDir, reporter, chunkSkel, commitHash, fileName, s.skipArchives); err != nil { | |
| logger.Error( | |
| err, | |
| "error handling binary file", | |
| "commit", commitHash, | |
| "path", fileName, | |
| ) | |
| } | |
| continue | |
| } |
|
@rosecodym I believe that it was always 1-indexed. The engine normalizes to 1 trufflehog/pkg/engine/engine.go Lines 1347 to 1350 in 50aa469
|
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful work! I flagged one last thing (that I should have caught the first time, sorry), but everything else looks great.
| assert.Truef( | ||
| t, | ||
| proto.Equal(chunk.SourceMetadata, tt.wantSourceMetadata), | ||
| "Source.Chunks() %s metadata mismatch: got %v, want %v", | ||
| tt.name, chunk.SourceMetadata, tt.wantSourceMetadata, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry I didn't catch this the first time around, but can you use cmp.Diff here? That would generate a more useful diff when there's a mismatch.
Description:
Fixes #1876
Problem
Files are processed in ~13KB chunks by default (10KB of actual data plus a 3KB peek into the next chunk). To infer the line the engine would take the line from chunk's metadata
SourceMetadata.Lineand would consider it absolute, that is it's relative to the entire file. The problem is it was never set anywhere. Since every chunk had line equals to 1 in metadata, all secret scan results appeared at line 1 + FragmentLineOffset() which resulted in line being relative to the chunk not the file itself.Solution
Track cumulative line numbers in
handleNonArchiveContent()by counting newlines as chunks are processed. EachDataOrErrnow carries the correct starting line, which flows through toSourceMetadata.Filesystem.Line, givingFragmentFirstLineAndLink()the correctfragStartfor the final calculation.Affected Sources
Directly fixed:
wget https://gist.githubusercontent.com/det/1526b4c16d0e07ac023d75c912a68658/raw/c3061c14a811205a65cbdcf0065bd3c11d88bfcb/test.txtNot affected (no
Linefield in proto):Linefield. I think it makes sense for S3 and GCS to report line numbers so it could be a good future change.[THIS PART WAS EDITED AFTER REALIZING GIT-BASED SOURCES ARE PARTIALLY AFFECTED]
Partially affected (own line tracking):
Checklist:
make test-community)?make lintthis requires golangci-lint)?