FEAT Add PromptInjectionScorer for OWASP LLM01 prompt injection detection#1774
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a local, regex-based PromptInjectionScorer to detect common prompt-injection patterns and includes unit tests to validate detection, rationale text, custom pattern overrides, and memory integration.
Changes:
- Introduces
PromptInjectionScorer(regex-based true/false scorer) with default OWASP-aligned prompt-injection pattern set. - Adds unit tests covering true positives/negatives, rationale strings, custom patterns, and memory write behavior.
- Exports
PromptInjectionScorerfrompyrit.scorefor public use.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/unit/score/test_prompt_injection_scorer.py | Adds unit tests validating detection behavior, rationales, custom patterns, and memory integration. |
| pyrit/score/true_false/prompt_injection_scorer.py | Implements a new regex-based prompt-injection scorer and default pattern set. |
| pyrit/score/init.py | Exposes PromptInjectionScorer from the pyrit.score public API. |
…hat template tokens
rlundeen2
approved these changes
May 28, 2026
- Rename PromptInjectionScorer -> StaticPromptInjectionScorer to clarify it's static (regex-based) detection vs model-based scorers - Expose categories parameter so callers can tag scores without subclassing (default still ['security']) - Drop overly-broad chat-template tokens (</?s>, bare [USER]/[SYSTEM]/[ASSISTANT]) that fired on HTML strikethrough and quoted transcripts - Document known high false-positive rate in class docstring (bounded gaps can span unrelated clauses) - Add negative tests for HTML strikethrough and quoted [USER]/[SYSTEM] transcripts, plus tests for custom and default categories Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #1704 — adds a
PromptInjectionScorerthat catches OWASP LLM01 prompt injection attempts with regex. Fast, local, no API call, no LLM in the loop.The gap I was trying to fill:
PromptShieldScoreris great but it's an Azure API call (so $$ per request), and theSelfAsk*Scorerfamily uses an LLM under the hood (slow + non-deterministic). For thousands of red-team iterations or as a cheap pre-filter in front of the heavier scorers, neither really fits.Subclassed
RegexScorerthe same wayCredentialLeakScorerdid. 8 default pattern categories:[INST],<<SYS>>,<|im_start|>etc.Pass
patterns=...to override defaults entirely if you want.Quickly checked the neighborhood for overlaps before opening this:
PromptShieldScorerandMarkdownInjectionScorerare different mechanisms / scope50 tests, all green. The tricky ones were the true negatives — there's a lot of normal technical language that looks injection-y: "how do I ignore a file in .gitignore", "decode this base64 string", the developer mode flag in debug logging. Wrote 13 of those specifically to lock down false positives. Also ran the full
tests/unit/score/locally, 1052 pass, no regressions.