Skip to content

[FEAT]: Added LLMJudge Evaluator#53

Open
bashirpartovi wants to merge 5 commits into
microsoft:mainfrom
bashirpartovi:dev/bashirpartovi/llmjudge
Open

[FEAT]: Added LLMJudge Evaluator#53
bashirpartovi wants to merge 5 commits into
microsoft:mainfrom
bashirpartovi:dev/bashirpartovi/llmjudge

Conversation

@bashirpartovi
Copy link
Copy Markdown
Contributor

@bashirpartovi bashirpartovi commented May 20, 2026

Add LLMJudge - Semantic Evaluator for Language-Level Attack Outcomes

Summary

This PR adds LLMJudge, a first-class evaluator for cases where attack success cannot be determined by tool calls, side effects, or string matching alone. It gives RAMPART a built-in way to answer semantic questions such as whether an agent disclosed sensitive information, followed an injected instruction in substance, or revealed capabilities it should not have mentioned.

The goal is not to replace deterministic evaluators. It is to close the gap they cannot cover and make semantic judgment a normal part of evaluator composition rather than a separate workflow.

Why this change

RAMPART already does well with crisp, mechanical signals. If the right question is "did a tool run?" or "did this exact text appear?" then deterministic evaluators are the right abstraction.

The problem is that many safety outcomes are not mechanical. XPIA-style tests often hinge on meaning, intent, and context across turns. Teams end up with brittle regexes, ad hoc one-off judges, or manual review precisely where they need the most confidence.

LLMJudge exists to make those semantic checks explicit and reusable while preserving the discipline of the existing evaluation model. Deterministic evaluators stay the first line of defense; the judge handles the residual cases that actually require reasoning.

Design at a glance

Architecturally, the judge is just another evaluator. It receives an EvalContext, renders a constrained prompt around the relevant transcript, sends that request through the existing PyRIT path, and converts the structured response back into a normal EvalResult.

That decision matters. By keeping the feature inside the evaluator abstraction, RAMPART does not need a second composition model, a second result type, or special execution plumbing just for semantic checks. Teams can combine deterministic and semantic signals in one expression and keep cheap, reliable checks ahead of the LLM when they are sufficient.

flowchart LR
    A[Attack execution produces EvalContext] --> B{Evaluator composition}
    B -->|Deterministic signal is enough| C[Return result]
    B -->|Semantic judgment is needed| D[LLMJudge]
    D --> E[Render objective plus selected transcript]
    E --> F[Send through PyRIT PromptNormalizer and target]
    F --> G[Judge model returns structured verdict]
    G --> H[Map verdict to EvalResult]
    H --> I[Compose with other evaluator results]
Loading

Rationale behind the shape of the API

The public surface stays intentionally small because the common case should be simple: define what you want to detect and provide a judge model. Everything else is secondary.

At the same time, the design leaves room for advanced integrations. Teams that already have a configured PyRIT target, a custom provider, or a test double can enter through that path directly instead of being forced through one provider-specific constructor shape. This keeps the default path lightweight without making the feature closed to real deployment environments.

Transcript scope is also configurable, but only in the two ways that matter in practice: judge the full interaction or judge only the latest turn. That is enough to support both cumulative behaviors and "did the last answer cross the line?" style checks without introducing a large transcript-slicing policy surface.

Reliability model

The failure behavior is designed around how test authors actually debug systems.

If the judge cannot run because the environment is misconfigured, authentication is broken, or the target is unavailable, that should surface as an actual execution error. Those are setup problems and should be visible immediately.

If the judge runs but the model is flaky, rate-limited, empty, or temporarily returns malformed structured output, the evaluator degrades to UNDETERMINED rather than crashing the whole evaluation path. That keeps semantic judging compatible with RAMPART's trinary outcome model and allows composed evaluators to keep producing useful results even when the LLM is imperfect.

Structured output is a key part of that design. The judge does not return freeform prose to the framework. It returns a constrained verdict that can be mapped back into the normal evaluator result shape, including outcome, confidence, rationale, and evidence. That makes the result both machine-usable and explainable to developers when a verdict is surprising.

Security and observability

An LLM judge is evaluating attacker-influenced content, so the transcript must be treated as data rather than as instructions. The prompt construction keeps the trust boundary under framework control, and rendered transcripts omit raw attachment payload content so attacker-controlled blobs do not get a direct path into the judging prompt.

This does not eliminate prompt-injection risk; no LLM judge can honestly claim that. What it does is centralize the defense in the framework instead of leaving every caller to rediscover the same boundary on their own.

Judge requests also travel through the existing PyRIT normalization path. That was deliberate for two reasons: it keeps provider behavior consistent with the rest of the stack, and it preserves the existing observability story. When a verdict looks wrong, the team can inspect the normalized interaction and raw model response using the same debugging path already used for other model-backed components.

Why this is a good fit for RAMPART

The main benefit of this design is not just that RAMPART can now do semantic judging. It is that semantic judging fits naturally into the framework RAMPART already has.

Teams can keep deterministic evaluators for crisp signals, use LLMJudge where intent and context matter, and compose both without changing how tests are authored or how results are interpreted. That is the key reason to ship this as an evaluator rather than as a parallel scoring subsystem.

In practice, this unlocks a class of safety assertions that were previously awkward or brittle: disclosure in substance rather than exact phrase, policy violation despite compliant-sounding wording, and injection success that is visible only when the full exchange is interpreted as a whole.

Follow-up direction

This PR intentionally stops at the evaluator form. A natural follow-up is scorer-backed construction, where a future factory or adapter can wrap PyRIT scorers and expose them as RAMPART evaluators. That is adjacent to this design, but it has different lifecycle and output-translation concerns, so it is cleaner as follow-up work rather than part of the initial abstraction.

Integration Tests

image

@bashirpartovi bashirpartovi requested a review from a team May 20, 2026 19:03
Comment thread tests/unit/evaluators/test_llm_judge.py Fixed
spencrr
spencrr previously approved these changes May 20, 2026
Comment thread docs/usage/authoring-tests.md
Comment thread docs/usage/authoring-tests.md Outdated
Comment thread rampart/common/templates.py
Comment thread rampart/core/errors.py
Comment thread rampart/common/templates.py Outdated
Comment thread rampart/evaluators/prompts/llm_judge.yaml
Comment thread rampart/evaluators/llm_judge.py
Comment thread rampart/evaluators/llm_judge.py
Comment thread rampart/evaluators/llm_judge.py
Comment thread rampart/evaluators/llm_judge.py
@spencrr spencrr mentioned this pull request May 20, 2026
3 tasks
Comment thread rampart/evaluators/llm_judge.py
Comment thread docs/usage/authoring-tests.md Outdated
Comment thread docs/usage/authoring-tests.md Outdated
spencrr
spencrr previously approved these changes May 21, 2026
Comment thread docs/usage/authoring-tests.md
Comment thread rampart/evaluators/llm_judge.py
Comment thread .gitignore
Comment on lines +26 to +27
Icon

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants