Skip to content

Conversation

@GLEF1X
Copy link

@GLEF1X GLEF1X commented Dec 23, 2025

Description:

Fixes #1876

Problem

Files are processed in ~13KB chunks by default (10KB of actual data plus a 3KB peek into the next chunk). To infer the line the engine would take the line from chunk's metadata SourceMetadata.Line and would consider it absolute, that is it's relative to the entire file. The problem is it was never set anywhere. Since every chunk had line equals to 1 in metadata, all secret scan results appeared at line 1 + FragmentLineOffset() which resulted in line being relative to the chunk not the file itself.

Solution

Track cumulative line numbers in handleNonArchiveContent() by counting newlines as chunks are processed. Each DataOrErr now carries the correct starting line, which flows through to SourceMetadata.Filesystem.Line, giving FragmentFirstLineAndLink() the correct fragStart for the final calculation.

Affected Sources

Directly fixed:

  • Filesystem - now reports accurate line numbers. I have tested it using the test file provided in the issue wget https://gist.githubusercontent.com/det/1526b4c16d0e07ac023d75c912a68658/raw/c3061c14a811205a65cbdcf0065bd3c11d88bfcb/test.txt

Not affected (no Line field in proto):

  • S3, GCS, Jenkins, stdin - use handlers but their metadata protos lack a Line field. I think it makes sense for S3 and GCS to report line numbers so it could be a good future change.

[THIS PART WAS EDITED AFTER REALIZING GIT-BASED SOURCES ARE PARTIALLY AFFECTED]
Partially affected (own line tracking):

  • Git, GitHub, GitLab - regular text diffs use git-based scanning with built-in line tracking (unaffected). However, binary archives (tar.gz, zip) go through HandleBinary() → handlers.HandleFile() and benefit from this fix.
  • others like BitBucket use their own implementations so not affected

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

…ecurity#1876)

  Files processed through handlers are split into ~10KB chunks, but line
  numbers weren't tracked across boundaries, causing secrets to always
  report line 0.

  - Add LineNumber field to DataOrErr to propagate line context
  - Add ContentSize() to ChunkResult to avoid double-counting peek overlap
  - Clone metadata per chunk via populateChunkLineNumber() to prevent
    shared state
@GLEF1X GLEF1X requested a review from a team December 23, 2025 06:23
@GLEF1X GLEF1X requested review from a team as code owners December 23, 2025 06:23
@CLAassistant
Copy link

CLAassistant commented Dec 23, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR - it is beautifully precise, detailed, and documented. In addition to the few inline comments I left, I have this question: Does this change generated by the default handler to be 1-indexed instead of 0-indexed?

}

// NewChunkResult creates a ChunkResult with the given data and content size.
// This is primarily used for testing to create mock chunk results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be primarily used for this? It looks like we could use this constructor in the actual program logic, too. We're (slowly, gradually) trying to reduce the number of identical things done multiple different ways, so I think that if there's only one place in the program logic that ChunkResults are created, it would be a good idea to use this constructor there too.

Copy link
Author

@GLEF1X GLEF1X Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should be more idiomatic especially since it's a little weird to use public constructor just for testing. I committed a change

Comment on lines +465 to +470
case *source_metadatapb.MetaData_Github:
m.Github.Line = lineNumber
case *source_metadatapb.MetaData_Gitlab:
m.Gitlab.Line = lineNumber
case *source_metadatapb.MetaData_Git:
m.Git.Line = lineNumber
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR says that these sources are "not affected" by this change. Are they included here just for completeness?

(To be clear, I would support this - although I think a comment warning about the oddity would be a good idea.)

Copy link
Author

@GLEF1X GLEF1X Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out! You're right, PR description was a bit inaccurate. I've added it for completeness because some git logic used it. I retested the binary file handling logic for git-based sources (local git, GitHub, GitLab) and confirmed this issue affects them as well.

Initially, I wasn't sure if

case isBinaryLine(latestState, line):
latestState = BinaryFileLine
path, ok := pathFromBinaryLine(line)
if !ok {
err = fmt.Errorf(`expected line to match 'Binary files a/fileA and b/fileB differ', got "%s"`, line)
ctx.Logger().Error(err, "Failed to parse BinaryFileLine")
latestState = ParseFailure
continue
}
// Don't do anything if the file is deleted. (pathA has file path, pathB is /dev/null)
if path != "" {
currentDiff.PathB = path
currentDiff.IsBinary = true
}
was legacy code or actively used, and I couldn't reproduce the issue at the time. I've since been able to reproduce it using this test repo: https://github.com/GLEF1X/test-trufflehog-again

(Note: the commits are unverified — I used a random username for testing purposes.)

I took the same file from the original issue, packed it into a tarball, and committed it. Results:

Before fix

...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Version: 2
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: content/secret.txt
Line: 557 <- uses git chunker so it's correctly infrred
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/content/secret.txt#L557
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Timestamp: 2026-01-07 01:32:05 +0000
...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Version: 2
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: secret.tar.gz
Line: 288 <- uses HandleBinary(thus HandleFile which had a bug with chunk line numbers)
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/secret.tar.gz#L288
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Repository_local_path: /var/folders/hw/r5j4bcyd3472ccwjz0klh5840000gn/T/trufflehog-40414-225149517
Timestamp: 2026-01-07 01:32:05 +0000
...

After fix

...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Version: 2
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: content/secret.txt
Line: 557
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/content/secret.txt#L557
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Timestamp: 2026-01-07 01:32:05 +0000
...
Found unverified result 🐷🔑❓
Detector Type: Github
Decoder Type: PLAIN
Raw result: ghs_012345678901234567890123456789012345
Rotation_guide: https://howtorotate.com/docs/tutorials/github/
Version: 2
Commit: 0f189b9a88e42452d0384a661f33da16b718b2e5
Email: T <[email protected]>
File: secret.tar.gz
Line: 557 <- same line number
Link: https://github.com/GLEF1X/test-trufflehog-again/blob/0f189b9a88e42452d0384a661f33da16b718b2e5/secret.tar.gz#L557
Repository: https://github.com/GLEF1X/test-trufflehog-again.git
Repository_local_path: /var/folders/hw/r5j4bcyd3472ccwjz0klh5840000gn/T/trufflehog-40569-2815386483
Timestamp: 2026-01-07 01:32:05 +0000
...

This applies to binary files in git, github and gitlab

if err := HandleBinary(ctx, gitDir, reporter, chunkSkel, commitHash, fileName, s.skipArchives); err != nil {
logger.Error(
err,
"error handling binary file",
"commit", commitHash,
"path", fileName,
)
}
continue
}

@GLEF1X
Copy link
Author

GLEF1X commented Jan 7, 2026

@rosecodym I believe that it was always 1-indexed. The engine normalizes to 1

// Ensure we maintain 1-based line indexing if fragmentStart is not set or is 0.
if *fragmentStart == 0 {
*fragmentStart = 1
}

Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work! I flagged one last thing (that I should have caught the first time, sorry), but everything else looks great.

Comment on lines +88 to +93
assert.Truef(
t,
proto.Equal(chunk.SourceMetadata, tt.wantSourceMetadata),
"Source.Chunks() %s metadata mismatch: got %v, want %v",
tt.name, chunk.SourceMetadata, tt.wantSourceMetadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry I didn't catch this the first time around, but can you use cmp.Diff here? That would generate a more useful diff when there's a mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Secrets are reported on the wrong line

4 participants