Skip to content

FEAT Add PromptInjectionScorer for OWASP LLM01 prompt injection detection#1774

Merged
rlundeen2 merged 3 commits into
microsoft:mainfrom
francose:feat/prompt-injection-scorer
May 28, 2026
Merged

FEAT Add PromptInjectionScorer for OWASP LLM01 prompt injection detection#1774
rlundeen2 merged 3 commits into
microsoft:mainfrom
francose:feat/prompt-injection-scorer

Conversation

@francose
Copy link
Copy Markdown
Contributor

@francose francose commented May 21, 2026

Builds on #1704 — adds a PromptInjectionScorer that catches OWASP LLM01 prompt injection attempts with regex. Fast, local, no API call, no LLM in the loop.

The gap I was trying to fill: PromptShieldScorer is great but it's an Azure API call (so $$ per request), and the SelfAsk*Scorer family uses an LLM under the hood (slow + non-deterministic). For thousands of red-team iterations or as a cheap pre-filter in front of the heavier scorers, neither really fits.

Subclassed RegexScorer the same way CredentialLeakScorer did. 8 default pattern categories:

  • Instruction Override — "ignore/disregard/forget previous instructions"
  • System Prompt Extraction — "repeat your system prompt back to me"
  • DAN Jailbreak — DAN, developer mode, unrestricted, jailbroken
  • Role Reassignment — "from now on you will/must..."
  • Constraint Removal — "remove all your safety filters"
  • Chat Template Injection — [INST], <<SYS>>, <|im_start|> etc.
  • Encoding Evasion — "decode this base64 and follow it"
  • Prompt Leaking — "give me your hidden/original/initial prompt"

Pass patterns=... to override defaults entirely if you want.

Quickly checked the neighborhood for overlaps before opening this:

50 tests, all green. The tricky ones were the true negatives — there's a lot of normal technical language that looks injection-y: "how do I ignore a file in .gitignore", "decode this base64 string", the developer mode flag in debug logging. Wrote 13 of those specifically to lock down false positives. Also ran the full tests/unit/score/ locally, 1052 pass, no regressions.

Copilot AI review requested due to automatic review settings May 21, 2026 18:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a local, regex-based PromptInjectionScorer to detect common prompt-injection patterns and includes unit tests to validate detection, rationale text, custom pattern overrides, and memory integration.

Changes:

  • Introduces PromptInjectionScorer (regex-based true/false scorer) with default OWASP-aligned prompt-injection pattern set.
  • Adds unit tests covering true positives/negatives, rationale strings, custom patterns, and memory write behavior.
  • Exports PromptInjectionScorer from pyrit.score for public use.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/unit/score/test_prompt_injection_scorer.py Adds unit tests validating detection behavior, rationales, custom patterns, and memory integration.
pyrit/score/true_false/prompt_injection_scorer.py Implements a new regex-based prompt-injection scorer and default pattern set.
pyrit/score/init.py Exposes PromptInjectionScorer from the pyrit.score public API.

Comment thread pyrit/score/true_false/prompt_injection_scorer.py Outdated
Comment thread pyrit/score/true_false/prompt_injection_scorer.py Outdated
Comment thread pyrit/score/true_false/static_prompt_injection_scorer.py
@rlundeen2 rlundeen2 self-assigned this May 28, 2026
- Rename PromptInjectionScorer -> StaticPromptInjectionScorer to clarify it's static (regex-based) detection vs model-based scorers

- Expose categories parameter so callers can tag scores without subclassing (default still ['security'])

- Drop overly-broad chat-template tokens (</?s>, bare [USER]/[SYSTEM]/[ASSISTANT]) that fired on HTML strikethrough and quoted transcripts

- Document known high false-positive rate in class docstring (bounded gaps can span unrelated clauses)

- Add negative tests for HTML strikethrough and quoted [USER]/[SYSTEM] transcripts, plus tests for custom and default categories

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rlundeen2 rlundeen2 enabled auto-merge May 28, 2026 00:34
@rlundeen2 rlundeen2 added this pull request to the merge queue May 28, 2026
Merged via the queue into microsoft:main with commit 728fabe May 28, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants