fix(detect): exempt source-code files from generic-keyword secret filter#719
Open
jippi wants to merge 1 commit intosafishamsi:v7from
Open
fix(detect): exempt source-code files from generic-keyword secret filter#719jippi wants to merge 1 commit intosafishamsi:v7from
jippi wants to merge 1 commit intosafishamsi:v7from
Conversation
The generic keyword filter ('credential|secret|passwd|password|token|
private_key') currently substring-matches against every filename regardless
of extension, silently dropping legitimate source-code files from the graph
when their name happens to contain one of those words:
password-reset.ts - email-template helper
AuthOauthAccessToken.model.ts - Sequelize model
test.search-tokenizer.ts - test fixture
JwtTokenValidator.java - auth code
password_manager.py - manager class
access-token.svelte - UI component
Source code files are CODE, not credential storage. If a project commits
real secrets into source files, that's a different security problem and
this filter is the wrong line of defense.
Two changes:
1. Pull the keyword pattern out into a named constant so _is_sensitive()
can reference it specifically.
2. _is_sensitive() now skips the keyword pattern when the file's
extension is in CODE_EXTENSIONS. The other patterns (.env*, .pem/.key
crypto extensions, id_rsa SSH keys, .netrc/.pgpass/.htpasswd,
aws_credentials/etc.) target real credential storage by extension or
exact name and continue to apply to all files regardless of code/data
classification.
3. Word boundaries on the keyword pattern (`\b...s?\b`) so substring
matches inside larger words ('tokenizer', 'secretary', 'passwordless',
'credentialing', 'AccessToken') don't trip the filter on data files
either. Optional `s?` covers plurals so canonical secret-storage
filenames like 'secrets.json' and 'database-credentials.yml' remain
flagged.
Tests
-----
26 new tests in tests/test_is_sensitive.py:
- 8 source-code files with auth keywords (.ts/.py/.js/.java/.svelte/.d.ts)
that previously were dropped now pass through.
- 12 specific-pattern checks (.env, .pem, id_rsa, .netrc, .pgpass,
aws_credentials, etc.) confirm those still flag as sensitive.
- 4 word-boundary edge cases confirm 'secretary', 'tokenizer',
'passwordless', 'credentialing' don't false-positive on non-code files.
- 4 data-file checks confirm 'secrets.json', 'credentials.yml',
'password.txt', 'api-token.json' still flag.
- 2 mixed cases confirm a code-extension exemption doesn't bypass
the .pem / .env-style structural patterns.
13 of 26 fail against unpatched v7 (real regression guards); the other 13
test pre-existing flag behavior that this PR preserves.
Validation
----------
On a real 1,873-file SvelteKit codebase: 3 files previously dropped from
the graph (password-reset.ts, AuthOauthAccessToken.model.ts,
test.search-tokenizer.ts) are now extracted and contribute their
outgoing import edges. Isolated .ts file count drops from 32 to 29.
The remaining isolated files are CLI entry points, test helpers loaded
by name from spec files, framework-loaded SvelteKit `+page.server.ts`
files, and root configs - all legitimately not imported by name.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #718. Independent of the other open PRs (#714, #715, #717) — applies cleanly off
v7.Problem
The sensitive-file heuristic in
_is_sensitivehas a generic keyword pattern that substring-matches every filename regardless of extension:This silently drops legitimate source-code files from the graph when their name mentions auth concepts:
password-reset.tsAuthOauthAccessToken.model.tstest.search-tokenizer.tsJwtTokenValidator.javapassword_manager.pyaccess-token.svelteThere are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames. They get silently dropped with no warning, exactly the failure mode that made this hard to diagnose (had to compare
collect_files()output to the manifest to notice the gap).Background — relationship to #436
#436 (closed by
4738e88) reported the same class of bug. The fix in4738e88removed the full-path check (or p.search(full)), so directory paths no longer trip the filter. The original issue's proposed word-boundary refinement to the keyword regex itself was not included, and the broader question of whether the keyword pattern should apply to source-code files at all was not addressed.Fix
Two complementary refinements to
graphify/detect.py:Code-extension exemption. The keyword pattern is now pulled into a named constant
_SENSITIVE_KEYWORD_PATTERNand skipped when the file's extension is inCODE_EXTENSIONS. Source files are CODE, not credential storage. The structural patterns (.env*,.pem,id_rsa,.netrc,aws_credentials, etc.) continue to apply to all files, so a hypotheticalfoo.ts.pemis still flagged correctly.Word boundaries with optional plural.
\b(credential|secret|...)s?\bso substring matches inside larger words (tokenizer,secretary,passwordless,credentialing,AccessToken) no longer false-positive on data files. Thes?covers plurals so canonical secret-storage filenames likesecrets.jsonanddatabase-credentials.ymlremain flagged.Coverage matrix
password-reset.ts(code)AuthOauthAccessToken.model.ts(code)tokenizer.ts(code, substring match)JwtTokenValidator.java(code)secrets.json(data, plural keyword)database-credentials.yml(data, plural)password.txt(data)secretary-notes.txt(data, substring)tokenizer-config.json(data, substring).env,.env.local,.envrc*.pem,*.key,*.p12,*.certid_rsa,id_ed25519.pub.netrc,.pgpass,.htpasswdaws_credentials,gcloud_credentials.jsonfoo.ts.pem(mixed).pemis structural)Tests (26 new, all passing)
tests/test_is_sensitive.py:.ts/.py/.js/.java/.svelte/.d.tsfiles with auth-related names pass through..env,.env.local,.pem,.p12,.cert,id_rsa,id_ed25519.pub,.netrc,.pgpass,.htpasswd,aws_credentials,gcloud_credentials.jsonstill flagged.secrets.json,database-credentials.yml,password.txt,api-token.json..pemor.env.13 of 26 fail on unpatched
v7— real regression guards. The other 13 verify pre-existing flag behavior is preserved.Validation on a real codebase
A 1,873-file SvelteKit app:
password-reset.ts,AuthOauthAccessToken.model.ts,test.search-tokenizer.ts) are now extracted and contribute their outgoing import edges..tsfile count: 32 → 29.Test plan
v7(regression-guard quality)