[FEAT]: Added LLMJudge Evaluator#53
Open
bashirpartovi wants to merge 5 commits into
Open
Conversation
spencrr
previously approved these changes
May 20, 2026
3 tasks
nina-msft
reviewed
May 20, 2026
nina-msft
reviewed
May 20, 2026
nina-msft
reviewed
May 20, 2026
spencrr
previously approved these changes
May 21, 2026
spencrr
approved these changes
May 22, 2026
Comment on lines
+26
to
+27
| Icon | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
LLMJudge- Semantic Evaluator for Language-Level Attack OutcomesSummary
This PR adds
LLMJudge, a first-class evaluator for cases where attack success cannot be determined by tool calls, side effects, or string matching alone. It gives RAMPART a built-in way to answer semantic questions such as whether an agent disclosed sensitive information, followed an injected instruction in substance, or revealed capabilities it should not have mentioned.The goal is not to replace deterministic evaluators. It is to close the gap they cannot cover and make semantic judgment a normal part of evaluator composition rather than a separate workflow.
Why this change
RAMPART already does well with crisp, mechanical signals. If the right question is "did a tool run?" or "did this exact text appear?" then deterministic evaluators are the right abstraction.
The problem is that many safety outcomes are not mechanical. XPIA-style tests often hinge on meaning, intent, and context across turns. Teams end up with brittle regexes, ad hoc one-off judges, or manual review precisely where they need the most confidence.
LLMJudgeexists to make those semantic checks explicit and reusable while preserving the discipline of the existing evaluation model. Deterministic evaluators stay the first line of defense; the judge handles the residual cases that actually require reasoning.Design at a glance
Architecturally, the judge is just another evaluator. It receives an
EvalContext, renders a constrained prompt around the relevant transcript, sends that request through the existing PyRIT path, and converts the structured response back into a normalEvalResult.That decision matters. By keeping the feature inside the evaluator abstraction, RAMPART does not need a second composition model, a second result type, or special execution plumbing just for semantic checks. Teams can combine deterministic and semantic signals in one expression and keep cheap, reliable checks ahead of the LLM when they are sufficient.
flowchart LR A[Attack execution produces EvalContext] --> B{Evaluator composition} B -->|Deterministic signal is enough| C[Return result] B -->|Semantic judgment is needed| D[LLMJudge] D --> E[Render objective plus selected transcript] E --> F[Send through PyRIT PromptNormalizer and target] F --> G[Judge model returns structured verdict] G --> H[Map verdict to EvalResult] H --> I[Compose with other evaluator results]Rationale behind the shape of the API
The public surface stays intentionally small because the common case should be simple: define what you want to detect and provide a judge model. Everything else is secondary.
At the same time, the design leaves room for advanced integrations. Teams that already have a configured PyRIT target, a custom provider, or a test double can enter through that path directly instead of being forced through one provider-specific constructor shape. This keeps the default path lightweight without making the feature closed to real deployment environments.
Transcript scope is also configurable, but only in the two ways that matter in practice: judge the full interaction or judge only the latest turn. That is enough to support both cumulative behaviors and "did the last answer cross the line?" style checks without introducing a large transcript-slicing policy surface.
Reliability model
The failure behavior is designed around how test authors actually debug systems.
If the judge cannot run because the environment is misconfigured, authentication is broken, or the target is unavailable, that should surface as an actual execution error. Those are setup problems and should be visible immediately.
If the judge runs but the model is flaky, rate-limited, empty, or temporarily returns malformed structured output, the evaluator degrades to
UNDETERMINEDrather than crashing the whole evaluation path. That keeps semantic judging compatible with RAMPART's trinary outcome model and allows composed evaluators to keep producing useful results even when the LLM is imperfect.Structured output is a key part of that design. The judge does not return freeform prose to the framework. It returns a constrained verdict that can be mapped back into the normal evaluator result shape, including outcome, confidence, rationale, and evidence. That makes the result both machine-usable and explainable to developers when a verdict is surprising.
Security and observability
An LLM judge is evaluating attacker-influenced content, so the transcript must be treated as data rather than as instructions. The prompt construction keeps the trust boundary under framework control, and rendered transcripts omit raw attachment payload content so attacker-controlled blobs do not get a direct path into the judging prompt.
This does not eliminate prompt-injection risk; no LLM judge can honestly claim that. What it does is centralize the defense in the framework instead of leaving every caller to rediscover the same boundary on their own.
Judge requests also travel through the existing PyRIT normalization path. That was deliberate for two reasons: it keeps provider behavior consistent with the rest of the stack, and it preserves the existing observability story. When a verdict looks wrong, the team can inspect the normalized interaction and raw model response using the same debugging path already used for other model-backed components.
Why this is a good fit for RAMPART
The main benefit of this design is not just that RAMPART can now do semantic judging. It is that semantic judging fits naturally into the framework RAMPART already has.
Teams can keep deterministic evaluators for crisp signals, use
LLMJudgewhere intent and context matter, and compose both without changing how tests are authored or how results are interpreted. That is the key reason to ship this as an evaluator rather than as a parallel scoring subsystem.In practice, this unlocks a class of safety assertions that were previously awkward or brittle: disclosure in substance rather than exact phrase, policy violation despite compliant-sounding wording, and injection success that is visible only when the full exchange is interpreted as a whole.
Follow-up direction
This PR intentionally stops at the evaluator form. A natural follow-up is scorer-backed construction, where a future factory or adapter can wrap PyRIT scorers and expose them as RAMPART evaluators. That is adjacent to this design, but it has different lifecycle and output-translation concerns, so it is cleaner as follow-up work rather than part of the initial abstraction.
Integration Tests