fix(detect): exempt source-code files from generic-keyword secret filter by jippi · Pull Request #719 · safishamsi/graphify

jippi · 2026-05-04T21:05:52Z

Code written by Claude Code, reviewed by me (I'm not super familiar with Python, but SWE with +20 years experience).

Fixes #718. Independent of the other open PRs (#714, #715, #717) — applies cleanly off v7.

Problem

The sensitive-file heuristic in _is_sensitive has a generic keyword pattern that substring-matches every filename regardless of extension:

re.compile(r'(credential|secret|passwd|password|token|private_key)', re.IGNORECASE),

This silently drops legitimate source-code files from the graph when their name mentions auth concepts:

Filename	Real content
`password-reset.ts`	Email-template helper
`AuthOauthAccessToken.model.ts`	Sequelize model
`test.search-tokenizer.ts`	Test fixture
`JwtTokenValidator.java`	Auth code
`password_manager.py`	Manager class
`access-token.svelte`	UI component

There are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames. They get silently dropped with no warning, exactly the failure mode that made this hard to diagnose (had to compare collect_files() output to the manifest to notice the gap).

Background — relationship to #436

#436 (closed by 4738e88) reported the same class of bug. The fix in 4738e88 removed the full-path check (or p.search(full)), so directory paths no longer trip the filter. The original issue's proposed word-boundary refinement to the keyword regex itself was not included, and the broader question of whether the keyword pattern should apply to source-code files at all was not addressed.

Fix

Two complementary refinements to graphify/detect.py:

Code-extension exemption. The keyword pattern is now pulled into a named constant _SENSITIVE_KEYWORD_PATTERN and skipped when the file's extension is in CODE_EXTENSIONS. Source files are CODE, not credential storage. The structural patterns (.env*, .pem, id_rsa, .netrc, aws_credentials, etc.) continue to apply to all files, so a hypothetical foo.ts.pem is still flagged correctly.
Word boundaries with optional plural. \b(credential|secret|...)s?\b so substring matches inside larger words (tokenizer, secretary, passwordless, credentialing, AccessToken) no longer false-positive on data files. The s? covers plurals so canonical secret-storage filenames like secrets.json and database-credentials.yml remain flagged.

_SENSITIVE_KEYWORD_PATTERN = re.compile(
    r'\b(credential|secret|passwd|password|token|private_key)s?\b',
    re.IGNORECASE,
)

def _is_sensitive(path: Path) -> bool:
    name = path.name
    is_code = path.suffix.lower() in CODE_EXTENSIONS
    for pattern in _SENSITIVE_PATTERNS:
        if pattern is _SENSITIVE_KEYWORD_PATTERN and is_code:
            continue
        if pattern.search(name):
            return True
    return False

Coverage matrix

File	Before	After
`password-reset.ts` (code)	dropped	indexed
`AuthOauthAccessToken.model.ts` (code)	dropped	indexed
`tokenizer.ts` (code, substring match)	dropped	indexed
`JwtTokenValidator.java` (code)	dropped	indexed
`secrets.json` (data, plural keyword)	flagged	flagged
`database-credentials.yml` (data, plural)	flagged	flagged
`password.txt` (data)	flagged	flagged
`secretary-notes.txt` (data, substring)	dropped (false positive)	indexed
`tokenizer-config.json` (data, substring)	dropped (false positive)	indexed
`.env`, `.env.local`, `.envrc`	flagged	flagged
`.pem`, `.key`, `.p12`, `.cert`	flagged	flagged
`id_rsa`, `id_ed25519.pub`	flagged	flagged
`.netrc`, `.pgpass`, `.htpasswd`	flagged	flagged
`aws_credentials`, `gcloud_credentials.json`	flagged	flagged
`foo.ts.pem` (mixed)	flagged	flagged (`.pem` is structural)

Tests (26 new, all passing)

tests/test_is_sensitive.py:

Source-code exemption (8): legitimate .ts/.py/.js/.java/.svelte/.d.ts files with auth-related names pass through.
Specific-pattern preservation (12): .env, .env.local, .pem, .p12, .cert, id_rsa, id_ed25519.pub, .netrc, .pgpass, .htpasswd, aws_credentials, gcloud_credentials.json still flagged.
Data-file keyword still flagged (4): secrets.json, database-credentials.yml, password.txt, api-token.json.
Word-boundary correctness on data files (4): substring-but-not-real-word cases pass through.
Mixed structural cases (2): code-extension exemption is keyword-pattern-only and doesn't override structural patterns like .pem or .env.

13 of 26 fail on unpatched v7 — real regression guards. The other 13 verify pre-existing flag behavior is preserved.

Validation on a real codebase

A 1,873-file SvelteKit app:

3 source files previously silently dropped (password-reset.ts, AuthOauthAccessToken.model.ts, test.search-tokenizer.ts) are now extracted and contribute their outgoing import edges.
Isolated .ts file count: 32 → 29.
Total nodes: +6 (the new file nodes plus their per-symbol nodes).

Test plan

All 26 new tests pass
13 of 26 fail on unpatched v7 (regression-guard quality)
Full test suite: 575 pass, 7 pre-existing failures unrelated
Smoke test on real 1,873-file SvelteKit project — 3 previously-dropped files now in graph

The generic keyword filter ('credential|secret|passwd|password|token| private_key') currently substring-matches against every filename regardless of extension, silently dropping legitimate source-code files from the graph when their name happens to contain one of those words: password-reset.ts - email-template helper AuthOauthAccessToken.model.ts - Sequelize model test.search-tokenizer.ts - test fixture JwtTokenValidator.java - auth code password_manager.py - manager class access-token.svelte - UI component Source code files are CODE, not credential storage. If a project commits real secrets into source files, that's a different security problem and this filter is the wrong line of defense. Two changes: 1. Pull the keyword pattern out into a named constant so _is_sensitive() can reference it specifically. 2. _is_sensitive() now skips the keyword pattern when the file's extension is in CODE_EXTENSIONS. The other patterns (.env*, .pem/.key crypto extensions, id_rsa SSH keys, .netrc/.pgpass/.htpasswd, aws_credentials/etc.) target real credential storage by extension or exact name and continue to apply to all files regardless of code/data classification. 3. Word boundaries on the keyword pattern (`\b...s?\b`) so substring matches inside larger words ('tokenizer', 'secretary', 'passwordless', 'credentialing', 'AccessToken') don't trip the filter on data files either. Optional `s?` covers plurals so canonical secret-storage filenames like 'secrets.json' and 'database-credentials.yml' remain flagged. Tests ----- 26 new tests in tests/test_is_sensitive.py: - 8 source-code files with auth keywords (.ts/.py/.js/.java/.svelte/.d.ts) that previously were dropped now pass through. - 12 specific-pattern checks (.env, .pem, id_rsa, .netrc, .pgpass, aws_credentials, etc.) confirm those still flag as sensitive. - 4 word-boundary edge cases confirm 'secretary', 'tokenizer', 'passwordless', 'credentialing' don't false-positive on non-code files. - 4 data-file checks confirm 'secrets.json', 'credentials.yml', 'password.txt', 'api-token.json' still flag. - 2 mixed cases confirm a code-extension exemption doesn't bypass the .pem / .env-style structural patterns. 13 of 26 fail against unpatched v7 (real regression guards); the other 13 test pre-existing flag behavior that this PR preserves. Validation ---------- On a real 1,873-file SvelteKit codebase: 3 files previously dropped from the graph (password-reset.ts, AuthOauthAccessToken.model.ts, test.search-tokenizer.ts) are now extracted and contribute their outgoing import edges. Isolated .ts file count drops from 32 to 29. The remaining isolated files are CLI entry points, test helpers loaded by name from spec files, framework-loaded SvelteKit `+page.server.ts` files, and root configs - all legitimately not imported by name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(detect): exempt source-code files from generic-keyword secret filter#719

fix(detect): exempt source-code files from generic-keyword secret filter#719
jippi wants to merge 1 commit intosafishamsi:v7from
jippi:fix/is-sensitive-source-files

jippi commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jippi commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Background — relationship to #436

Fix

Coverage matrix

Tests (26 new, all passing)

Validation on a real codebase

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jippi commented May 4, 2026 •

edited

Loading