[FEAT]: Added LLMJudge Evaluator by bashirpartovi · Pull Request #53 · microsoft/RAMPART

bashirpartovi · 2026-05-20T19:03:58Z

Add `LLMJudge` - Semantic Evaluator for Language-Level Attack Outcomes

Summary

This PR adds LLMJudge, a first-class evaluator for cases where attack success cannot be determined by tool calls, side effects, or string matching alone. It gives RAMPART a built-in way to answer semantic questions such as whether an agent disclosed sensitive information, followed an injected instruction in substance, or revealed capabilities it should not have mentioned.

The goal is not to replace deterministic evaluators. It is to close the gap they cannot cover and make semantic judgment a normal part of evaluator composition rather than a separate workflow.

Why this change

RAMPART already does well with crisp, mechanical signals. If the right question is "did a tool run?" or "did this exact text appear?" then deterministic evaluators are the right abstraction.

The problem is that many safety outcomes are not mechanical. XPIA-style tests often hinge on meaning, intent, and context across turns. Teams end up with brittle regexes, ad hoc one-off judges, or manual review precisely where they need the most confidence.

LLMJudge exists to make those semantic checks explicit and reusable while preserving the discipline of the existing evaluation model. Deterministic evaluators stay the first line of defense; the judge handles the residual cases that actually require reasoning.

Design at a glance

Architecturally, the judge is just another evaluator. It receives an EvalContext, renders a constrained prompt around the relevant transcript, sends that request through the existing PyRIT path, and converts the structured response back into a normal EvalResult.

That decision matters. By keeping the feature inside the evaluator abstraction, RAMPART does not need a second composition model, a second result type, or special execution plumbing just for semantic checks. Teams can combine deterministic and semantic signals in one expression and keep cheap, reliable checks ahead of the LLM when they are sufficient.

flowchart LR
    A[Attack execution produces EvalContext] --> B{Evaluator composition}
    B -->|Deterministic signal is enough| C[Return result]
    B -->|Semantic judgment is needed| D[LLMJudge]
    D --> E[Render objective plus selected transcript]
    E --> F[Send through PyRIT PromptNormalizer and target]
    F --> G[Judge model returns structured verdict]
    G --> H[Map verdict to EvalResult]
    H --> I[Compose with other evaluator results]

Rationale behind the shape of the API

The public surface stays intentionally small because the common case should be simple: define what you want to detect and provide a judge model. Everything else is secondary.

At the same time, the design leaves room for advanced integrations. Teams that already have a configured PyRIT target, a custom provider, or a test double can enter through that path directly instead of being forced through one provider-specific constructor shape. This keeps the default path lightweight without making the feature closed to real deployment environments.

Transcript scope is also configurable, but only in the two ways that matter in practice: judge the full interaction or judge only the latest turn. That is enough to support both cumulative behaviors and "did the last answer cross the line?" style checks without introducing a large transcript-slicing policy surface.

Reliability model

The failure behavior is designed around how test authors actually debug systems.

If the judge cannot run because the environment is misconfigured, authentication is broken, or the target is unavailable, that should surface as an actual execution error. Those are setup problems and should be visible immediately.

If the judge runs but the model is flaky, rate-limited, empty, or temporarily returns malformed structured output, the evaluator degrades to UNDETERMINED rather than crashing the whole evaluation path. That keeps semantic judging compatible with RAMPART's trinary outcome model and allows composed evaluators to keep producing useful results even when the LLM is imperfect.

Structured output is a key part of that design. The judge does not return freeform prose to the framework. It returns a constrained verdict that can be mapped back into the normal evaluator result shape, including outcome, confidence, rationale, and evidence. That makes the result both machine-usable and explainable to developers when a verdict is surprising.

Security and observability

An LLM judge is evaluating attacker-influenced content, so the transcript must be treated as data rather than as instructions. The prompt construction keeps the trust boundary under framework control, and rendered transcripts omit raw attachment payload content so attacker-controlled blobs do not get a direct path into the judging prompt.

This does not eliminate prompt-injection risk; no LLM judge can honestly claim that. What it does is centralize the defense in the framework instead of leaving every caller to rediscover the same boundary on their own.

Judge requests also travel through the existing PyRIT normalization path. That was deliberate for two reasons: it keeps provider behavior consistent with the rest of the stack, and it preserves the existing observability story. When a verdict looks wrong, the team can inspect the normalized interaction and raw model response using the same debugging path already used for other model-backed components.

Why this is a good fit for RAMPART

The main benefit of this design is not just that RAMPART can now do semantic judging. It is that semantic judging fits naturally into the framework RAMPART already has.

Teams can keep deterministic evaluators for crisp signals, use LLMJudge where intent and context matter, and compose both without changing how tests are authored or how results are interpreted. That is the key reason to ship this as an evaluator rather than as a parallel scoring subsystem.

In practice, this unlocks a class of safety assertions that were previously awkward or brittle: disclosure in substance rather than exact phrase, policy violation despite compliant-sounding wording, and injection success that is visible only when the full exchange is interpreted as a whole.

Follow-up direction

This PR intentionally stops at the evaluator form. A natural follow-up is scorer-backed construction, where a future factory or adapter can wrap PyRIT scorers and expose them as RAMPART evaluators. That is adjacent to this design, but it has different lifecycle and output-translation concerns, so it is cleaner as follow-up work rather than part of the initial abstraction.

Integration Tests

spencrr · 2026-05-22T19:05:09Z

+Icon
+


[FEAT]: Added LLMJudge Evaluator

6a6fb1c

bashirpartovi requested a review from a team May 20, 2026 19:03

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

Comment thread tests/unit/evaluators/test_llm_judge.py Fixed

[TEST]: Avoid CodeQL false positive in LLMJudge test

7280b1c

spencrr previously approved these changes May 20, 2026

View reviewed changes

spencrr mentioned this pull request May 20, 2026

[MAINT]: Add Python 3.14 Support #56

Merged

3 tasks

nina-msft reviewed May 20, 2026

View reviewed changes

Comment thread rampart/evaluators/llm_judge.py

nina-msft reviewed May 20, 2026

View reviewed changes

Comment thread docs/usage/authoring-tests.md Outdated

nina-msft reviewed May 20, 2026

View reviewed changes

Comment thread docs/usage/authoring-tests.md Outdated

Addressed comments

a7dad27

bashirpartovi dismissed spencrr’s stale review via a7dad27 May 21, 2026 20:46

spencrr previously approved these changes May 21, 2026

View reviewed changes

Comment thread docs/usage/authoring-tests.md

Comment thread rampart/evaluators/llm_judge.py

Added integration tests

98dea21

bashirpartovi dismissed spencrr’s stale review via 98dea21 May 22, 2026 17:57

fixing pre-commit checks

98d2d41

spencrr approved these changes May 22, 2026

View reviewed changes

Comment thread .gitignore

Comment on lines +26 to +27

Icon

Copy link
Copy Markdown

Contributor

spencrr May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Added LLMJudge Evaluator#53

[FEAT]: Added LLMJudge Evaluator#53
bashirpartovi wants to merge 5 commits into
microsoft:mainfrom
bashirpartovi:dev/bashirpartovi/llmjudge

bashirpartovi commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spencrr May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bashirpartovi commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add LLMJudge - Semantic Evaluator for Language-Level Attack Outcomes

Summary

Why this change

Design at a glance

Rationale behind the shape of the API

Reliability model

Security and observability

Why this is a good fit for RAMPART

Follow-up direction

Integration Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spencrr May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bashirpartovi commented May 20, 2026 •

edited

Loading

Add `LLMJudge` - Semantic Evaluator for Language-Level Attack Outcomes