Skip to content

fix(detect): exempt source-code files from generic-keyword secret filter#719

Open
jippi wants to merge 1 commit intosafishamsi:v7from
jippi:fix/is-sensitive-source-files
Open

fix(detect): exempt source-code files from generic-keyword secret filter#719
jippi wants to merge 1 commit intosafishamsi:v7from
jippi:fix/is-sensitive-source-files

Conversation

@jippi
Copy link
Copy Markdown
Contributor

@jippi jippi commented May 4, 2026

Code written by Claude Code, reviewed by me (I'm not super familiar with Python, but SWE with +20 years experience).

Fixes #718. Independent of the other open PRs (#714, #715, #717) — applies cleanly off v7.

Problem

The sensitive-file heuristic in _is_sensitive has a generic keyword pattern that substring-matches every filename regardless of extension:

re.compile(r'(credential|secret|passwd|password|token|private_key)', re.IGNORECASE),

This silently drops legitimate source-code files from the graph when their name mentions auth concepts:

Filename Real content
password-reset.ts Email-template helper
AuthOauthAccessToken.model.ts Sequelize model
test.search-tokenizer.ts Test fixture
JwtTokenValidator.java Auth code
password_manager.py Manager class
access-token.svelte UI component

There are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames. They get silently dropped with no warning, exactly the failure mode that made this hard to diagnose (had to compare collect_files() output to the manifest to notice the gap).

Background — relationship to #436

#436 (closed by 4738e88) reported the same class of bug. The fix in 4738e88 removed the full-path check (or p.search(full)), so directory paths no longer trip the filter. The original issue's proposed word-boundary refinement to the keyword regex itself was not included, and the broader question of whether the keyword pattern should apply to source-code files at all was not addressed.

Fix

Two complementary refinements to graphify/detect.py:

  1. Code-extension exemption. The keyword pattern is now pulled into a named constant _SENSITIVE_KEYWORD_PATTERN and skipped when the file's extension is in CODE_EXTENSIONS. Source files are CODE, not credential storage. The structural patterns (.env*, .pem, id_rsa, .netrc, aws_credentials, etc.) continue to apply to all files, so a hypothetical foo.ts.pem is still flagged correctly.

  2. Word boundaries with optional plural. \b(credential|secret|...)s?\b so substring matches inside larger words (tokenizer, secretary, passwordless, credentialing, AccessToken) no longer false-positive on data files. The s? covers plurals so canonical secret-storage filenames like secrets.json and database-credentials.yml remain flagged.

_SENSITIVE_KEYWORD_PATTERN = re.compile(
    r'\b(credential|secret|passwd|password|token|private_key)s?\b',
    re.IGNORECASE,
)

def _is_sensitive(path: Path) -> bool:
    name = path.name
    is_code = path.suffix.lower() in CODE_EXTENSIONS
    for pattern in _SENSITIVE_PATTERNS:
        if pattern is _SENSITIVE_KEYWORD_PATTERN and is_code:
            continue
        if pattern.search(name):
            return True
    return False

Coverage matrix

File Before After
password-reset.ts (code) dropped indexed
AuthOauthAccessToken.model.ts (code) dropped indexed
tokenizer.ts (code, substring match) dropped indexed
JwtTokenValidator.java (code) dropped indexed
secrets.json (data, plural keyword) flagged flagged
database-credentials.yml (data, plural) flagged flagged
password.txt (data) flagged flagged
secretary-notes.txt (data, substring) dropped (false positive) indexed
tokenizer-config.json (data, substring) dropped (false positive) indexed
.env, .env.local, .envrc flagged flagged
*.pem, *.key, *.p12, *.cert flagged flagged
id_rsa, id_ed25519.pub flagged flagged
.netrc, .pgpass, .htpasswd flagged flagged
aws_credentials, gcloud_credentials.json flagged flagged
foo.ts.pem (mixed) flagged flagged (.pem is structural)

Tests (26 new, all passing)

tests/test_is_sensitive.py:

  • Source-code exemption (8): legitimate .ts/.py/.js/.java/.svelte/.d.ts files with auth-related names pass through.
  • Specific-pattern preservation (12): .env, .env.local, .pem, .p12, .cert, id_rsa, id_ed25519.pub, .netrc, .pgpass, .htpasswd, aws_credentials, gcloud_credentials.json still flagged.
  • Data-file keyword still flagged (4): secrets.json, database-credentials.yml, password.txt, api-token.json.
  • Word-boundary correctness on data files (4): substring-but-not-real-word cases pass through.
  • Mixed structural cases (2): code-extension exemption is keyword-pattern-only and doesn't override structural patterns like .pem or .env.

13 of 26 fail on unpatched v7 — real regression guards. The other 13 verify pre-existing flag behavior is preserved.

Validation on a real codebase

A 1,873-file SvelteKit app:

  • 3 source files previously silently dropped (password-reset.ts, AuthOauthAccessToken.model.ts, test.search-tokenizer.ts) are now extracted and contribute their outgoing import edges.
  • Isolated .ts file count: 32 → 29.
  • Total nodes: +6 (the new file nodes plus their per-symbol nodes).

Test plan

  • All 26 new tests pass
  • 13 of 26 fail on unpatched v7 (regression-guard quality)
  • Full test suite: 575 pass, 7 pre-existing failures unrelated
  • Smoke test on real 1,873-file SvelteKit project — 3 previously-dropped files now in graph

The generic keyword filter ('credential|secret|passwd|password|token|
private_key') currently substring-matches against every filename regardless
of extension, silently dropping legitimate source-code files from the graph
when their name happens to contain one of those words:

  password-reset.ts                  - email-template helper
  AuthOauthAccessToken.model.ts      - Sequelize model
  test.search-tokenizer.ts           - test fixture
  JwtTokenValidator.java             - auth code
  password_manager.py                - manager class
  access-token.svelte                - UI component

Source code files are CODE, not credential storage. If a project commits
real secrets into source files, that's a different security problem and
this filter is the wrong line of defense.

Two changes:

1. Pull the keyword pattern out into a named constant so _is_sensitive()
   can reference it specifically.

2. _is_sensitive() now skips the keyword pattern when the file's
   extension is in CODE_EXTENSIONS. The other patterns (.env*, .pem/.key
   crypto extensions, id_rsa SSH keys, .netrc/.pgpass/.htpasswd,
   aws_credentials/etc.) target real credential storage by extension or
   exact name and continue to apply to all files regardless of code/data
   classification.

3. Word boundaries on the keyword pattern (`\b...s?\b`) so substring
   matches inside larger words ('tokenizer', 'secretary', 'passwordless',
   'credentialing', 'AccessToken') don't trip the filter on data files
   either. Optional `s?` covers plurals so canonical secret-storage
   filenames like 'secrets.json' and 'database-credentials.yml' remain
   flagged.

Tests
-----
26 new tests in tests/test_is_sensitive.py:
  - 8 source-code files with auth keywords (.ts/.py/.js/.java/.svelte/.d.ts)
    that previously were dropped now pass through.
  - 12 specific-pattern checks (.env, .pem, id_rsa, .netrc, .pgpass,
    aws_credentials, etc.) confirm those still flag as sensitive.
  - 4 word-boundary edge cases confirm 'secretary', 'tokenizer',
    'passwordless', 'credentialing' don't false-positive on non-code files.
  - 4 data-file checks confirm 'secrets.json', 'credentials.yml',
    'password.txt', 'api-token.json' still flag.
  - 2 mixed cases confirm a code-extension exemption doesn't bypass
    the .pem / .env-style structural patterns.

13 of 26 fail against unpatched v7 (real regression guards); the other 13
test pre-existing flag behavior that this PR preserves.

Validation
----------
On a real 1,873-file SvelteKit codebase: 3 files previously dropped from
the graph (password-reset.ts, AuthOauthAccessToken.model.ts,
test.search-tokenizer.ts) are now extracted and contribute their
outgoing import edges. Isolated .ts file count drops from 32 to 29.

The remaining isolated files are CLI entry points, test helpers loaded
by name from spec files, framework-loaded SvelteKit `+page.server.ts`
files, and root configs - all legitimately not imported by name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant