| # | Task | Priority | Description | % Done | Reviewed | Notes |
|---|---|---|---|---|---|---|
| 0 | Pre-flight checks | P0 | Verify packages, MEAI IEvaluator availability, ScoreNormalizer | 100% | ✅ | Baseline builds clean, MEAI types available via Core |
| 1 | Package reference | P0 | Add Microsoft.Extensions.AI.Evaluation.Quality to MAF .csproj |
100% | ✅ | Added explicitly for clarity |
| 2 | Directory structure | P0 | Create src/AgentEval.MAF/Evaluators/ |
100% | ✅ | 6 source files + test dir |
| 3 | ConversationExtractor | P0 | Extract input/output/tools from ChatMessage[] | 100% | ✅ | 129 lines, 8 tests |
| 4 | AdditionalContextHelper | P0 | Extract RAG context, ground truth, expected tools from additionalContext | 100% | ✅ | 106 lines, 3 context subtypes, 10 direct tests |
| 5 | ResultConverter | P0 | MetricResult ↔ MEAI EvaluationResult, score normalization | 100% | ✅ | 66 lines, 10 tests (incl. Theory) |
| 6 | AgentEvalMetricAdapter | P0 | Single IMetric → MEAI IEvaluator adapter | 100% | ✅ | 69 lines, 8 tests (incl. ExpectedTools) |
| 7 | AgentEvalEvaluator | P0 | Composite N metrics → single MEAI IEvaluator | 100% | ✅ | 92 lines, 6 tests |
| 8 | AgentEvalEvaluators | P0 | Static factory: Quality(), RAG(), Safety(), Agentic(), Advanced() | 100% | ✅ | 125 lines, 5 tests |
| 9 | Unit tests | P0 | Tests for all 6 new classes | 100% | ✅ | 55 unit tests across 6 test files |
| 10 | Integration smoke test | P0 | End-to-end light path test without live LLM | 100% | ✅ | 4 integration tests, all passing |
| 11 | Full build + test | P0 | dotnet build + dotnet test — zero errors, zero regressions |
100% | ✅ | 0 warnings, 0 errors, 2578/2578 tests pass |
| 12 | Final review | P0 | Review all files, verify coverage, check for gaps | 100% | ✅ | 587 lines source, ~830 lines tests, doc aligned |
Six new production classes in src/AgentEval.MAF/Evaluators/ (587 lines total) that enable any AgentEval IMetric to be used as an MEAI IEvaluator inside MAF's evaluation orchestration:
| Class | Purpose | Key Responsibility |
|---|---|---|
ConversationExtractor |
Data extraction | Extracts input text (last-turn split, matching ADR-0020 default) and tool usage (FunctionCallContent/FunctionResultContent pairs) from IEnumerable<ChatMessage> + ChatResponse |
AdditionalContextHelper |
Context enrichment | Extracts RAG context, ground truth, and expected tool names from MEAI's additionalContext parameter via 3 custom subtypes (AgentEvalRAGContext, AgentEvalGroundTruthContext, AgentEvalExpectedToolsContext) |
ResultConverter |
Score translation | Converts AgentEval MetricResult (0–100 scale) → MEAI NumericMetric (1–5 scale) with EvaluationMetricInterpretation preserving the original score, rating, and pass/fail status |
AgentEvalMetricAdapter |
Single-metric adapter | Wraps one AgentEval IMetric as an MEAI IEvaluator — the core bridge class |
AgentEvalEvaluator |
Composite adapter | Bundles N IMetric instances into a single MEAI IEvaluator with graceful per-metric error handling |
AgentEvalEvaluators |
Public API factory | Static factory with preset bundles (.Quality(), .RAG(), .Safety(), .Agentic(), .Advanced()) and individual metric accessors — the developer-facing entry point |
Seven test files in tests/AgentEval.Tests/MAF/Evaluators/ (59 tests, ~830 lines) covering every public method, edge case, and the full end-to-end flow.
Only one file was modified: src/AgentEval.MAF/AgentEval.MAF.csproj — added an explicit <PackageReference Include="Microsoft.Extensions.AI.Evaluation.Quality" /> (the package was already transitively available via AgentEval.Core, but the explicit reference makes the dependency clear and intentional).
Zero changes to existing source files. All 7 existing MAF files (MAFAgentAdapter, MAFEvaluationHarness, MAFWorkflowEventBridge, MAFWorkflowAdapter, MAFGraphExtractor, MAFIdentifiableAgentAdapter, WorkflowEvaluationHarness) are untouched.
The actual MEAI v10.3.0 IEvaluator interface differed from what was initially assumed in the plan. All adaptations are documented in Section 17, Step 0.2. Key differences:
- Return type is
ValueTask<EvaluationResult>(notTask) - Requires
IReadOnlyCollection<string> EvaluationMetricNamesproperty - Uses
IEnumerable<ChatMessage>(notIList) ChatResponsehas.Text(not.Message.Text)EvaluationContextbase class requires constructor args — custom subtypes use constructors, notrequired initpropertiesNumericMetricusesInterpretationproperty withEvaluationMetricInterpretation— noAddDiagnosticmethod
Once MAF ships the AgentEvaluationExtensions from ADR-0020 (the agent.EvaluateAsync() orchestration layer), developers will be able to use AgentEval's 19 metrics directly inside MAF's evaluation pipeline with zero friction.
using AgentEval.MAF.Evaluators;
// Create a MAF agent as usual
var agent = new ChatClientAgent(chatClient, new ChatClientAgentOptions
{
Name = "TravelAgent",
ChatOptions = new ChatOptions
{
Instructions = "You are a helpful travel assistant.",
Tools = [AIFunctionFactory.Create(SearchFlights), AIFunctionFactory.Create(BookHotel)]
}
});
// Evaluate with AgentEval quality metrics — one line
AgentEvaluationResults results = await agent.EvaluateAsync(
queries: ["What's the weather in Seattle?", "Plan a weekend trip to Portland"],
evaluator: AgentEvalEvaluators.Quality(judgeClient));
results.AssertAllPassed(); // throws if any metric fails
// Each item in results contains 4 AgentEval metrics:
// - llm_faithfulness (1-5 MEAI scale, with original 0-100 in interpretation)
// - llm_relevance
// - llm_coherence
// - llm_fluencyusing AgentEval.MAF.Evaluators;
using Microsoft.Extensions.AI.Evaluation.Quality;
// Mix evaluators from three different providers in one call
IReadOnlyList<AgentEvaluationResults> results = await agent.EvaluateAsync(
queries: ["Search for flights to Paris", "What restaurants are near the Eiffel Tower?"],
evaluators: [
// MEAI built-in (ships with Microsoft.Extensions.AI.Evaluation)
new RelevanceEvaluator(),
// AgentEval — safety bundle (toxicity, bias, misinformation)
AgentEvalEvaluators.Safety(judgeClient),
// AgentEval — agentic metrics with expected tools
AgentEvalEvaluators.Agentic(expectedTools: ["SearchFlights", "SearchRestaurants"]),
// Azure AI Foundry (cloud-based evaluation with dashboard)
new FoundryEvals(projectClient, "gpt-4o"),
]);
// One result per evaluator — check each
foreach (var r in results)
r.AssertAllPassed();using AgentEval.MAF.Evaluators;
// Pass ground truth and RAG context via additionalContext
AgentEvaluationResults results = await agent.EvaluateAsync(
queries: ["What is the capital of France?"],
evaluator: AgentEvalEvaluators.RAG(judgeClient),
expectedOutput: ["Paris"],
additionalContext: [
new AgentEvalRAGContext("France is a country in Western Europe. Its capital is Paris."),
new AgentEvalGroundTruthContext("Paris"),
]);
results.AssertAllPassed();
// RAG bundle runs 5 metrics:
// - llm_faithfulness: is the answer grounded in the provided context?
// - llm_relevance: does the answer address the question?
// - llm_context_precision: is the context relevant to the question?
// - llm_context_recall: does the context contain the answer?
// - llm_answer_correctness: does the answer match the ground truth?For capabilities that require live execution data (not just conversation transcripts), the existing deep path via MAFEvaluationHarness and WorkflowEvaluationHarness remains the path:
// Deep path — streaming evaluation with TTFT, tool timelines, cost tracking
var harness = new MAFEvaluationHarness(evaluatorClient);
var result = await harness.RunEvaluationStreamingAsync(
new MAFAgentAdapter(agent), testCase,
new StreamingOptions { OnFirstToken = ttft => Log($"TTFT: {ttft.TotalMs}ms") });
// result.Performance.TimeToFirstToken — not available through IEvaluator
// result.Timeline.ToAsciiDiagram() — not available through IEvaluator
// result.Performance.EstimatedCost — not available through IEvaluatorBoth paths share the same metric implementations. The light path is for quick, integrated evaluation. The deep path is for comprehensive analysis and debugging.
Date: 2026-03-25 (updated) Context: PR microsoft/agent-framework#4731 — "ADR-0020: Foundry Evals Integration" (merged 2026-03-20) Author: bentho (Ben Thompson) AgentEval contact: joslat
MAF's ADR-0020 introduces a provider-agnostic evaluation protocol with Azure AI Foundry as the first cloud provider. The .NET side reuses MEAI's IEvaluator as the core contract and adds a thin orchestration layer (AgentEvaluationExtensions) that runs agents, calls IEvaluator per item, and aggregates results.
The critical insight: AgentEval already has ~2,500 lines of deep MAF integration (AgentEval.MAF) that provides capabilities MAF's thin orchestration fundamentally cannot replicate: streaming with TTFT, tool call timelines, ChatProtocol/TurnToken handling, per-executor workflow graph analysis, session management, stochastic evaluation, and model comparison. Simply implementing IEvaluator and plugging into MAF's orchestration would abandon our deepest differentiator.
The architecture must be dual-path:
- Light path (MAF-native): AgentEval metrics as MEAI
IEvaluator— for MAF users who want advanced metrics in their simple pipeline - Deep path (AgentEval-native): Full MAF integration via existing harnesses — for users who need comprehensive evaluation with rich observability
- Shared result layer: Bidirectional conversion so results from either path are interoperable
| Concept | Type | Package |
|---|---|---|
| Evaluator contract | MEAI's IEvaluator (no new interface) |
Microsoft.Extensions.AI.Evaluation |
| Orchestration | AgentEvaluationExtensions (extension methods on AIAgent, Run) |
Microsoft.Agents.AI |
| Aggregate result | AgentEvaluationResults (passed/failed/total + sub-results for workflows) |
Microsoft.Agents.AI |
| Data carrier | EvalItem (conversation + tools + expected output + split strategy) |
Microsoft.Agents.AI |
| Local checks | LocalEvaluator + FunctionEvaluator.Create() |
Microsoft.Agents.AI |
| Cloud provider | FoundryEvals (batches to Azure AI Foundry) |
Microsoft.Agents.AI.AzureAI |
- No new evaluator interface in .NET — reuses MEAI's
IEvaluator(EvaluateAsync(messages, chatResponse, chatConfig)) - Conversation is the single source of truth —
EvalItem.conversationcontains fullChatMessagehistory;query/responseare derived via split strategies - Multiple evaluators compose naturally — pass
IEnumerable<IEvaluator>, get oneAgentEvaluationResultsper evaluator - Conversation split strategies — last-turn (default), full-conversation, per-turn, or custom
IConversationSplitter - Workflow evaluation —
run.EvaluateAsync()extracts per-agent data from workflow events, providesSubResultsbreakdown
agent.EvaluateAsync(queries, evaluator)
│
├── For each query:
│ 1. Runs agent → gets ChatResponse
│ 2. Builds (messages, response) pair
│ 3. Calls evaluator.EvaluateAsync(messages, response, chatConfig)
│ 4. Collects EvaluationResult
│
└── Aggregates into AgentEvaluationResults { Passed, Failed, Total, AssertAllPassed() }
This is the critical gap that drives our architecture:
| Capability | MAF ADR-0020 | AgentEval.MAF (existing) |
|---|---|---|
| Run agent, collect response | Yes | Yes |
| Call evaluator per item | Yes | Yes |
| Streaming with TTFT tracking | No | MAFEvaluationHarness + MAFAgentAdapter |
| Tool call timeline (start/end/duration per tool) | No | ToolCallTimeline + streaming chunk tracking |
| Session management (multi-turn/memory) | No | ISessionResettableAgent + AgentSession |
| ChatProtocol detection + TurnToken | No | MAFWorkflowEventBridge |
| Per-executor workflow breakdown with tool calls | Partial (LINQ walk) | MAFWorkflowAdapter + ExecutorStep |
| Workflow graph extraction | No | MAFGraphExtractor + WorkflowGraphSnapshot |
| Edge traversal tracking | No | EdgeTraversedEvent + EdgeExecution |
| Parallel branch tracking | No | ParallelBranch records |
| Routing decision capture | No | RoutingDecision records |
| Workflow assertion DSL | No | WorkflowEvaluationHarness (7 built-in assertions) |
| Stochastic evaluation (N runs + P90) | num_repetitions only |
StochasticRunner with full statistics |
| Model comparison with rankings | No | ModelComparer + IModelIdentifiable |
| Calibrated multi-judge consensus | No | CalibratedJudge |
| Cost estimation per evaluation | No | ModelPricing |
| Failure reports with root cause analysis | No | FailureReport with suggestions |
| Export to JUnit/SARIF/PDF | Portal only | Full export pipeline |
This is ~2,500 lines of production code that represents AgentEval's deepest MAF differentiation. We cannot afford to flatten this into a simple IEvaluator adapter.
MAFAgentAdapter (130 lines) — Wraps AIAgent as IStreamableAgent + ISessionResettableAgent:
- Session management: creates/resets
AgentSessionper invocation or explicitly - Streaming: extracts
TextContent,FunctionCallContent,FunctionResultContent,UsageContentfrom response chunks - Token usage: captures from
AgentResponse.Usage(InputTokenCount, OutputTokenCount)
MAFIdentifiableAgentAdapter (147 lines) — Adds IModelIdentifiable for model comparison:
ModelId(e.g., "gpt-4o-2024-08-06") andModelDisplayName(e.g., "GPT-4o")- Used by
ModelComparerfor multi-model evaluation
MAFWorkflowEventBridge (314 lines) — Translates MAF's class-based events to AgentEval records:
- ChatProtocol detection:
workflow.DescribeProtocolAsync()→ChatProtocolExtensions.IsChatProtocol() - TurnToken handling: Manually sends
TurnToken(emitEvents: true)for ChatProtocol-based executors (strings are silently dropped without this) - Event translation:
ExecutorInvokedEvent→ edge tracking,AgentResponseUpdateEvent→ tool call buffering + text accumulation,ExecutorCompletedEvent→ output flush,WorkflowOutputEvent→ workflow completion - Tool call buffering:
FunctionCallContentbuffered untilFunctionResultContentarrives → completeExecutorToolCallEventwith duration
MAFWorkflowAdapter (649 lines) — Captures detailed per-executor execution data:
ExecutorSteprecords: executorId, output, startOffset, duration, stepIndex, toolCalls, incoming/outgoing edges, parallelBranchIdEdgeExecutionrecords: source/target, edgeType, conditionResult, matchedSwitchLabel, routingReasonWorkflowGraphSnapshot: nodes, edges, traversedEdges, parallelBranches, entry/exit node IDsRoutingDecisionrecords: deciderExecutorId, possibleEdgeIds, selectedEdgeId, evaluatedValue- Factory methods:
FromMAFWorkflow(),FromSteps(),WithGraph(),FromConditionalSteps()
MAFGraphExtractor (173 lines) — Static graph extraction from Workflow objects:
- Uses
Workflow.ReflectEdges()(public API) - Translates
EdgeKind(Direct, FanOut, FanIn) →EdgeType(Sequential, Conditional, ParallelFanOut, ParallelFanIn) - Full MAF ID → clean name mapping (
"Planner_abc123"→"Planner") - Exit node detection (executors with no outgoing edges)
MAFEvaluationHarness (743 lines) — General agent evaluation:
- Single test, streaming test, batch test, suite execution
- Performance metrics: TTFT, token counts (actual or estimated), cost estimation
- Tool call timeline extraction from
RawMessages - AI-powered evaluation via
IEvaluator(AgentEval's) FailureReportwith categorized reasons, severity levels, and suggestions
WorkflowEvaluationHarness (435 lines) — Workflow-specific testing:
- 7 built-in assertions: executor order (strict/loose), expected executors, output contains, max duration, no errors, expected tools, per-executor tools
WorkflowTestResultwithAssertionResultslist- Suite execution with
ContinueOnFailureoption
From Microsoft.Agents.AI: AIAgent, AgentSession, ChatClientAgent, ChatClientAgentOptions, AIFunctionFactory
From Microsoft.Agents.AI.Workflows: Workflow, WorkflowBuilder, InProcessExecution, StreamingRun, TurnToken, ChatProtocolExtensions, EdgeKind, ExecutorInvokedEvent, ExecutorCompletedEvent, ExecutorFailedEvent, AgentResponseUpdateEvent, WorkflowOutputEvent, WorkflowErrorEvent, SuperStepStartedEvent
From Microsoft.Extensions.AI: ChatMessage, ChatRole, FunctionCallContent, FunctionResultContent, TextContent, UsageContent, ChatOptions
MAF's agent.EvaluateAsync() runs the agent, captures the ChatResponse, and passes (messages, response) to IEvaluator. That's it. By the time the evaluator receives the data, the execution is over and the rich observability data is lost:
- No streaming chunks → no TTFT
- No per-tool start/end timestamps → no tool timeline
- No MAF event stream → no workflow graph traversal data
- No session management → no multi-turn memory evaluation
- No stochastic repetition → no statistical analysis
An IEvaluator implementation receives a conversation transcript. AgentEval's deep path captures the entire execution lifecycle.
Think of it as the difference between:
- Post-mortem evaluation (IEvaluator): "Here's a conversation log. Score it." — This is what MAF's orchestration provides.
- Live evaluation (AgentEval harness): "I watched every streaming chunk, every tool call start/end, every workflow edge traversal, every executor transition in real-time. Here's a comprehensive analysis." — This is what AgentEval.MAF provides.
Both are valuable. Different users need different depths. We must support both.
"AgentEval extends this with more advanced scenarios (tool-level evals, RAG grounding, cost, policies, red teaming, upcoming memory evaluation, etc.)"
This is correct — but understates it. The real extension isn't just "more metrics." It's a fundamentally richer evaluation execution model that captures data MAF's thin orchestration layer doesn't.
"Implement
AgentEvalEvaluator : IEvaluator"
This is necessary but not sufficient. On its own it flattens AgentEval into a metric scorer and abandons the deep integration. The correct framing:
AgentEval provides two integration surfaces: (1) Advanced metrics as MEAI
IEvaluator— plug into any MAF evaluation pipeline, and (2) A comprehensive evaluation engine for MAF agents with deep observability that MAF's native orchestration cannot provide.
┌─────────────────────────────────────────────────────────────────────────┐
│ DEVELOPER CODE │
│ │
│ ┌─── Light Path (MAF-native) ────────────────────────────────────────┐ │
│ │ │ │
│ │ // Quick quality check — uses MAF orchestration │ │
│ │ agent.EvaluateAsync(queries, AgentEvalEvaluators.Quality(client)); │ │
│ │ │ │
│ │ // Mix with Foundry — MAF handles everything │ │
│ │ agent.EvaluateAsync(queries, [agentEvalMetrics, foundryEvals]); │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Deep Path (AgentEval-native) ───────────────────────────────────┐ │
│ │ │ │
│ │ // Full evaluation with streaming, tool timelines, cost tracking │ │
│ │ var harness = new MAFEvaluationHarness(evaluatorClient); │ │
│ │ var result = await harness.RunEvaluationStreamingAsync( │ │
│ │ agent.AsEvaluableAgent(), testCase, streamingOpts); │ │
│ │ │ │
│ │ // Workflow evaluation with graph analysis & assertions │ │
│ │ var wfHarness = new WorkflowEvaluationHarness(evaluator, logger); │ │
│ │ var wfResult = await wfHarness.RunWorkflowTestAsync( │ │
│ │ MAFWorkflowAdapter.FromMAFWorkflow(workflow, ...), │ │
│ │ testCase); │ │
│ │ │ │
│ │ // Stochastic evaluation — 10 runs with statistics │ │
│ │ var stochastic = await stochasticRunner.RunStochasticTestAsync( │ │
│ │ agent.AsEvaluableAgent(), testCase, │ │
│ │ new StochasticOptions { Repetitions = 10 }); │ │
│ │ │ │
│ │ // Model comparison │ │
│ │ var comparison = await modelComparer.CompareModelsAsync( │ │
│ │ [gpt4oFactory, gpt4MiniFactory], testCase); │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Shared Result Layer ────────────────────────────────────────────┐ │
│ │ │ │
│ │ // Deep path results → MAF format (for .AssertAllPassed()) │ │
│ │ var mafResults = result.ToAgentEvaluationResults(); │ │
│ │ │ │
│ │ // Either path → AgentEval export pipeline │ │
│ │ await exporter.ExportAsync(result, "junit", stream); │ │
│ │ await exporter.ExportAsync(result, "sarif", stream); │ │
│ │ await exporter.ExportAsync(result, "pdf", stream); │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
| Aspect | Light Path (MAF-native) | Deep Path (AgentEval-native) |
|---|---|---|
| Entry point | agent.EvaluateAsync() |
MAFEvaluationHarness / WorkflowEvaluationHarness |
| Orchestration | MAF's AgentEvaluationExtensions |
AgentEval's harnesses |
| Agent wrapping | None (uses AIAgent directly) |
MAFAgentAdapter / MAFWorkflowAdapter |
| Streaming data | No | Yes (TTFT, per-chunk tool tracking) |
| Tool timeline | No | Yes (ToolCallTimeline with ASCII diagrams) |
| Session management | No | Yes (ISessionResettableAgent) |
| ChatProtocol/TurnToken | No (MAF handles internally) | Yes (explicit in MAFWorkflowEventBridge) |
| Workflow graph | No | Yes (WorkflowGraphSnapshot, edge traversal) |
| Workflow assertions | No | Yes (7 built-in + custom) |
| Per-executor tools | Partial | Full (per ExecutorStep) |
| Stochastic runs | num_repetitions (basic) |
StochasticRunner (P90, std dev, success rate) |
| Model comparison | No | Yes (ModelComparer + rankings) |
| Calibrated judges | No | Yes (CalibratedJudge) |
| Cost estimation | No | Yes (ModelPricing) |
| Failure diagnostics | Basic pass/fail | FailureReport with reasons + suggestions |
| Result format | AgentEvaluationResults |
TestResult / WorkflowTestResult (convertible) |
| Export | Foundry portal | JUnit, SARIF, Markdown, JSON, PDF |
| Best for | Quick checks, CI gates | Deep evaluation, debugging, perf analysis |
Both paths share the same metrics (IMetric implementations) and the same evaluation intelligence. The difference is what data the metrics receive:
- Light path: Metrics receive conversation text only (extracted from
ChatMessage[]) - Deep path: Metrics receive
EvaluationContextwith tool usage, performance metrics, timeline, ground truth, context — the full picture
For metrics that only need text (quality metrics), both paths produce equivalent results. For metrics that need richer data (tool success, cost, latency), the deep path is required.
Goal: Any MAF user can use AgentEval metrics via agent.EvaluateAsync().
File: src/AgentEval.MAF/Evaluators/AgentEvalMetricAdapter.cs
Wraps a single AgentEval.Core.IMetric as Microsoft.Extensions.AI.Evaluation.IEvaluator. Extracts input/output from ChatMessage[] + ChatResponse, builds an EvaluationContext, runs the metric, and converts MetricResult → MEAI EvaluationResult.
Score mapping: MetricResult.Score (0-100) → NumericMetric (1-5 scale for MEAI compat, raw 0-100 in interpretation).
File: src/AgentEval.MAF/Evaluators/AgentEvalEvaluator.cs
Composite evaluator bundling multiple AgentEval metrics. Returns one EvaluationResult containing all metric scores as named metrics.
File: src/AgentEval.MAF/Evaluators/AgentEvalEvaluators.cs
public static class AgentEvalEvaluators
{
// Preset bundles
public static AgentEvalEvaluator Quality(IChatClient judgeClient);
public static AgentEvalEvaluator RAG(IChatClient judgeClient);
public static AgentEvalEvaluator Agentic();
public static AgentEvalEvaluator Safety(IChatClient judgeClient);
public static AgentEvalEvaluator Memory(IChatClient judgeClient);
public static AgentEvalEvaluator Advanced(IChatClient judgeClient); // all metrics
// Individual metrics (for mixing with MEAI/Foundry evaluators)
public static IEvaluator Faithfulness(IChatClient judgeClient);
public static IEvaluator ToolSuccess();
// ... one per metric
// Custom
public static AgentEvalEvaluator Custom(params IMetric[] metrics);
}ConversationExtractor— Extracts input/output fromChatMessage[]using split strategies (last-turn, full, per-turn)EvalItemToContextConverter— Converts MEAI evaluation data → AgentEvalEvaluationContextResultConverter— Bidirectional conversion:MetricResult↔EvaluationResult
Goal: Existing deep integration produces results consumable by both AgentEval and MAF ecosystems.
File: src/AgentEval.MAF/Evaluators/AgentEvalResultsAdapter.cs
public static class AgentEvalResultsAdapter
{
// Deep path → MAF format
public static AgentEvaluationResults ToAgentEvaluationResults(TestResult result);
public static AgentEvaluationResults ToAgentEvaluationResults(TestSummary summary);
public static AgentEvaluationResults ToAgentEvaluationResults(WorkflowTestResult result);
public static AgentEvaluationResults ToAgentEvaluationResults(StochasticResult result);
// MAF format → Deep path (for feeding MAF results into AgentEval export)
public static TestSummary ToTestSummary(AgentEvaluationResults results);
}This allows:
// Deep evaluation → MAF assertion pattern
var result = await harness.RunEvaluationStreamingAsync(agent, testCase);
result.ToAgentEvaluationResults().AssertAllPassed();
// Deep evaluation → SARIF export
await sarifExporter.ExportAsync(result, stream);When converting workflow results, map per-executor steps to SubResults:
// WorkflowTestResult with steps [Planner, Researcher, Writer]
// → AgentEvaluationResults with SubResults:
// "Planner": { Passed: 1, Total: 1 }
// "Researcher": { Passed: 1, Total: 1 }
// "Writer": { Passed: 1, Total: 1 }This aligns with ADR-0020's sub_results pattern for workflow evaluation.
Goal: Enhance the deep path with ADR-0020 concepts that enrich evaluation.
File: src/AgentEval.MAF/Evaluators/ConversationSplitters.cs
Add split strategy support to MAFEvaluationHarness for multi-turn evaluation:
- Last-turn: evaluate only the final exchange (current default behavior)
- Full: evaluate entire conversation trajectory
- Per-turn: evaluate each turn independently → returns multiple
TestResults
Aligns with ADR-0020's ConversationSplit.LAST_TURN / FULL / PER_TURN.
File: src/AgentEval.MAF/Evaluators/ToolCallAssertions.cs
Add ADR-0020-compatible expected tool call validation to the deep path:
var testCase = new TestCase {
ExpectedToolCalls = [
new ExpectedToolCall("SearchFlights", new { destination = "Paris" }),
new ExpectedToolCall("BookHotel")
]
};Leverages existing ToolUsageExtractor + WorkflowEvaluationHarness assertions.
Ensure EvaluationContext carries GroundTruth through both paths:
- Light path: via MEAI's
additionalContextparameter (customGroundTruthContextsubtype) - Deep path: via
TestCase.ExpectedOutputContains(existing) + newTestCase.GroundTruthproperty
Goal: Expose AgentEval's advanced capabilities (stochastic, comparison, red team) as convenient extension methods on MAF types.
File: src/AgentEval.MAF/MAF/MAFAgentExtensions.cs
public static class MAFAgentExtensions
{
// Stochastic evaluation with deep path (streaming + tool timelines + statistics)
public static Task<StochasticResult> EvaluateStochasticAsync(
this AIAgent agent,
TestCase testCase,
StochasticOptions? options = null,
CancellationToken ct = default);
// Model comparison via deep path
public static Task<ModelComparisonResult> CompareAsync(
this IEnumerable<(AIAgent Agent, string ModelId, string DisplayName)> agents,
TestCase testCase,
ModelComparisonOptions? options = null,
CancellationToken ct = default);
// Red team via deep path
public static Task<RedTeamResult> RedTeamAsync(
this AIAgent agent,
RedTeamOptions? options = null,
CancellationToken ct = default);
// Memory evaluation via deep path
public static Task<MemoryBenchmarkResult> EvaluateMemoryAsync(
this AIAgent agent,
MemoryBenchmarkOptions? options = null,
CancellationToken ct = default);
}These methods use the deep path internally (wrapping AIAgent with MAFAgentAdapter) while providing a MAF-native API surface.
File: src/AgentEval.MAF/DependencyInjection/MAFEvaluatorServiceExtensions.cs
public static class MAFEvaluatorServiceExtensions
{
// Register AgentEval metrics as MEAI IEvaluator (light path)
public static IServiceCollection AddAgentEvalAsMAFEvaluator(
this IServiceCollection services,
Action<AgentEvalMAFOptions>? configure = null);
// Register full AgentEval pipeline for MAF (deep path)
public static IServiceCollection AddAgentEvalForMAF(
this IServiceCollection services,
Action<AgentEvalMAFOptions>? configure = null);
}File: src/AgentEval.MAF/Evaluators/AgentEvalContexts.cs
Custom MEAI EvaluationContext subtypes for passing rich data through the light path:
public class GroundTruthContext : Microsoft.Extensions.AI.Evaluation.EvaluationContext
{
public string GroundTruth { get; init; }
}
public class ToolExpectationContext : Microsoft.Extensions.AI.Evaluation.EvaluationContext
{
public IReadOnlyList<ExpectedToolCall> ExpectedToolCalls { get; init; }
}
public class RAGContext : Microsoft.Extensions.AI.Evaluation.EvaluationContext
{
public string RetrievedContext { get; init; }
}| File | Purpose | Phase |
|---|---|---|
src/AgentEval.MAF/Evaluators/AgentEvalMetricAdapter.cs |
Single-metric MEAI IEvaluator adapter | 1 |
src/AgentEval.MAF/Evaluators/AgentEvalEvaluator.cs |
Composite multi-metric MEAI IEvaluator | 1 |
src/AgentEval.MAF/Evaluators/AgentEvalEvaluators.cs |
Static factory with preset bundles | 1 |
src/AgentEval.MAF/Evaluators/ConversationExtractor.cs |
ChatMessage → input/output extraction | 1 |
src/AgentEval.MAF/Evaluators/EvalItemToContextConverter.cs |
MEAI data → AgentEval EvaluationContext | 1 |
src/AgentEval.MAF/Evaluators/ResultConverter.cs |
Bidirectional result conversion | 1 |
src/AgentEval.MAF/Evaluators/AgentEvalResultsAdapter.cs |
TestResult ↔ AgentEvaluationResults | 2 |
src/AgentEval.MAF/Evaluators/ConversationSplitters.cs |
Split strategy implementations | 3 |
src/AgentEval.MAF/Evaluators/ToolCallAssertions.cs |
Expected tool call validation | 3 |
src/AgentEval.MAF/Evaluators/AgentEvalContexts.cs |
Custom MEAI EvaluationContext subtypes | 6 |
src/AgentEval.MAF/MAF/MAFAgentExtensions.cs |
Extension methods for advanced capabilities | 4 |
src/AgentEval.MAF/DependencyInjection/MAFEvaluatorServiceExtensions.cs |
DI registration | 5 |
tests/AgentEval.MAF.Tests/Evaluators/ |
Unit tests for all new classes | All |
samples/AgentEval.MAFIntegration/ |
End-to-end sample showing both paths | All |
| File | Change |
|---|---|
src/AgentEval.MAF/AgentEval.MAF.csproj |
Add Microsoft.Extensions.AI.Evaluation package reference |
src/AgentEval.Core/Adapters/MicrosoftEvaluatorAdapter.cs |
Extract shared score conversion to ScoreNormalizer |
| File | Why |
|---|---|
MAFAgentAdapter.cs |
Core deep path — session management, streaming |
MAFIdentifiableAgentAdapter.cs |
Core deep path — model comparison |
MAFWorkflowEventBridge.cs |
Core deep path — ChatProtocol/TurnToken/event translation |
MAFWorkflowAdapter.cs |
Core deep path — per-executor/graph/routing capture |
MAFGraphExtractor.cs |
Core deep path — static graph extraction |
MAFEvaluationHarness.cs |
Core deep path — general evaluation with rich metrics |
WorkflowEvaluationHarness.cs |
Core deep path — workflow assertions |
// Just swap in AgentEval metrics where you'd use MEAI evaluators
var results = await agent.EvaluateAsync(
queries: ["What's the weather in Seattle?", "Plan a trip to Portland"],
evaluator: AgentEvalEvaluators.Quality(judgeClient));
results.AssertAllPassed();
// Mix AgentEval + MEAI + Foundry
var results = await agent.EvaluateAsync(
queries,
evaluators: [
new RelevanceEvaluator(), // MEAI native
AgentEvalEvaluators.Agentic(), // AgentEval tools metrics
AgentEvalEvaluators.Faithfulness(judgeClient), // AgentEval individual
new FoundryEvals(projectClient, "gpt-4o"), // Foundry cloud
]);// Agent evaluation with streaming, tool timelines, cost tracking
var harness = new MAFEvaluationHarness(evaluatorClient);
var agent = new MAFAgentAdapter(myAgent);
var result = await harness.RunEvaluationStreamingAsync(agent, testCase,
new StreamingOptions
{
OnFirstToken = ttft => Console.WriteLine($"TTFT: {ttft.TotalMilliseconds}ms"),
OnToolStart = tool => Console.WriteLine($"Tool: {tool.Name}"),
});
// Rich performance data
Console.WriteLine($"Tokens: {result.Performance.TotalTokens}");
Console.WriteLine($"Cost: ${result.Performance.EstimatedCost:F4}");
Console.WriteLine($"Timeline:\n{result.Timeline.ToAsciiDiagram()}");
// Convert to MAF format if needed
result.ToAgentEvaluationResults().AssertAllPassed();
// Export to CI formats
await new JUnitExporter().ExportAsync(result, stream);
await new SarifExporter().ExportAsync(result, stream);var adapter = MAFWorkflowAdapter.FromMAFWorkflow(
workflow, "TripPlanner",
executorIds: ["Planner", "Researcher", "Writer"]);
var wfHarness = new WorkflowEvaluationHarness(evaluator, logger);
var result = await wfHarness.RunWorkflowTestAsync(adapter, new WorkflowTestCase
{
Name = "Trip Planning E2E",
Input = "Plan a 3-day trip to Paris",
ExpectedExecutors = ["Planner", "Researcher", "Writer"],
StrictExecutorOrder = true,
ExpectedTools = ["SearchFlights", "SearchHotels"],
PerExecutorExpectedTools = new() {
["Researcher"] = ["SearchFlights", "SearchHotels"]
},
MaxDuration = TimeSpan.FromSeconds(30)
});
// Graph structure available
var graph = result.ExecutionResult.Graph;
Console.WriteLine($"Nodes: {graph.Nodes.Count}, Edges: {graph.Edges.Count}");
Console.WriteLine($"Traversed: {graph.TraversedEdges.Count} edges");// Stochastic: 10 runs with statistical analysis
var stochastic = await agent.EvaluateStochasticAsync(
testCase,
new StochasticOptions { Repetitions = 10 });
Console.WriteLine($"Mean: {stochastic.MeanScore:F1}, P90: {stochastic.P90Score:F1}");
// Compare models
var comparison = await new[] {
(gpt4o, "gpt-4o", "GPT-4o"),
(gpt4Mini, "gpt-4o-mini", "GPT-4o Mini")
}.CompareAsync(testCase);
Console.WriteLine($"Winner: {comparison.Winner.ModelDisplayName}");"Implement
AgentEvalEvaluator : IEvaluator→ so AgentEval can act as a more advanced evaluation engine behind the interface"
We're building two integration surfaces that share the same evaluation intelligence:
AgentEval metrics as MEAI
IEvaluator— soagent.EvaluateAsync(queries, AgentEvalEvaluators.Quality(client))just works. Your users get 30+ advanced metrics (faithfulness, tool success, RAG grounding, safety) with zero changes to their evaluation code.A comprehensive evaluation engine for MAF agents — for users who need more than per-item scoring. This includes streaming evaluation with TTFT tracking, per-executor workflow analysis with graph structure and edge traversal, stochastic evaluation with statistical analysis, model comparison, calibrated multi-judge consensus, red teaming, memory evaluation, and rich export (JUnit, SARIF, PDF).
Both paths produce interoperable results — deep path results convert to
AgentEvaluationResultsfor.AssertAllPassed(), and MAF results feed into AgentEval's export pipeline.We're not duplicating orchestration. MAF's
AgentEvaluationExtensionsruns agents and calls evaluators. AgentEval provides the evaluation intelligence behindIEvaluatorfor the light path, and provides a richer execution model for users who need it.
| Phase | Priority | Effort | Dependency |
|---|---|---|---|
| 1. Core MEAI Adapter (Light Path) | P0 | 3-4 days | MAF ADR-0020 merged (done) |
| 2. Result Interop (Bridge) | P0 | 2-3 days | Phase 1 |
| 3. Deep Path ADR-0020 Enhancements | P1 | 2-3 days | Phase 1 |
| 4. Advanced Extension Methods | P1 | 2-3 days | Phase 2 |
| 5. DI Integration | P1 | 1 day | Phase 2 |
| 6. MEAI Additional Context | P2 | 1-2 days | Phase 1 |
| Total | ~2-3 weeks |
- Unit tests: Each adapter/converter with mocked inputs
- Light path integration:
AIAgent.EvaluateAsync()withAgentEvalEvaluatorcontaining 3+ metrics - Deep path integration:
MAFEvaluationHarnessstreaming evaluation → export to JUnit - Interop test: Convert deep path
TestResult→AgentEvaluationResults.AssertAllPassed() - Mixed evaluators:
AgentEvalEvaluators.Agentic()+RelevanceEvaluator()+FoundryEvalsin same call - Workflow test:
WorkflowEvaluationHarnesswith per-executor breakdown → convert to MAF SubResults
-
AgentEvaluationResults— shipped in current RC? Needed for result adapter. If not yet, we target GA. -
IConversationSplitter— defined or conceptual? We should align implementations. If MAF hasn't defined it yet, we can propose it. -
additionalContextschema — is there a defined protocol for passing ground truth, expected tool calls through MEAI'sIEnumerable<EvaluationContext>? We should co-define. -
Workflow event access — MAF's
run.EvaluateAsync()walksOutgoingEvents. Does AgentEval need to consume these events too, or should we continue usingInProcessExecution.RunStreamingAsync()+WatchStreamAsync()which gives us richer data? -
Upstream contributions — Our
ConversationExtractorandToolCallAssertionscould benefit MAF'sLocalEvaluator. Worth discussing a PR.
Understanding the current data flow is critical before building the light path.
MicrosoftEvaluatorAdapter wraps any MEAI IEvaluator (e.g., RelevanceEvaluator) as an AgentEval IMetric, so it can run inside AgentEval's harnesses alongside native metrics.
MEAI IEvaluator (e.g., RelevanceEvaluator, CoherenceEvaluator)
│
│ MicrosoftEvaluatorAdapter wraps it as AgentEval IMetric
│ Score: 1-5 → ScoreNormalizer.FromOneToFive() → 0-100
│
▼
AgentEval IMetric.EvaluateAsync(EvaluationContext) → MetricResult
│
▼
MAFEvaluationHarness runs it alongside other AgentEval metrics
Key file: src/AgentEval.Core/Adapters/MicrosoftEvaluatorAdapter.cs (228 lines)
This is the light path — the reverse bridge:
AgentEval IMetric (e.g., FaithfulnessMetric, ToolSuccessMetric)
│
│ NEW: AgentEvalMetricAdapter wraps it as MEAI IEvaluator
│ Score: 0-100 → ScoreNormalizer.ToOneToFive() → 1-5
│
▼
MEAI IEvaluator.EvaluateAsync(messages, response, chatConfig) → EvaluationResult
│
▼
MAF's agent.EvaluateAsync() calls it alongside MEAI/Foundry evaluators
The reverse conversion ScoreNormalizer.ToOneToFive() already exists at src/AgentEval.Core/Core/ScoreNormalizer.cs:33. The light path is mirroring what we already built, in the opposite direction.
AIAgent (MAF)
│
│ MAFAgentAdapter wraps it as IStreamableAgent
│ - Session management (AgentSession)
│ - Streaming chunks: TextContent, FunctionCallContent, FunctionResultContent, UsageContent
│ - Token usage from AgentResponse.Usage
│
▼
MAFEvaluationHarness.RunEvaluationStreamingAsync()
│
├── Captures TTFT from first streaming chunk
├── Tracks tool calls with start/end timestamps per tool
├── Extracts token usage (actual or estimated)
├── Calculates cost via ModelPricing
├── Builds ToolCallTimeline with ASCII diagrams
├── Runs AgentEval IMetrics on rich EvaluationContext
├── Generates FailureReport with categorized reasons & suggestions
│
▼
TestResult { Score, Performance, ToolUsage, Timeline, Failure, CriteriaResults }
│
├── Export: JUnit XML, SARIF, Markdown, JSON, PDF
└── Assertions: Fluent API for tools, performance, response
For workflows, the deep path is even richer:
Workflow (MAF)
│
│ MAFWorkflowAdapter.FromMAFWorkflow()
│ - Uses Workflow.ReflectEdges() for static graph
│ - Builds fullId → cleanName mapping
│
▼
MAFWorkflowEventBridge.StreamAsAgentEvalEvents()
│
├── Detects ChatProtocol vs function-based executors
├── Sends ChatMessage input (not string — strings are silently dropped!)
├── Sends TurnToken(emitEvents: true) for ChatProtocol executors
├── Translates MAF events → AgentEval records:
│ ExecutorInvokedEvent → edge tracking + output flush
│ AgentResponseUpdateEvent → tool call buffering + text accumulation
│ ExecutorCompletedEvent → ExecutorOutputEvent
│ WorkflowOutputEvent → WorkflowCompleteEvent
│
▼
MAFWorkflowAdapter.ExecuteWorkflowAsync()
│
├── Per-executor: ExecutorStep { output, timing, toolCalls, edges, branchId }
├── Edge traversal: EdgeExecution { source, target, type, condition, routing }
├── Graph: WorkflowGraphSnapshot { nodes, edges, traversedEdges, branches }
├── Routing: RoutingDecision { decider, possibleEdges, selected, reason }
│
▼
WorkflowEvaluationHarness.RunWorkflowTestAsync()
│
├── Assertion: Executor order (strict or loose)
├── Assertion: Expected executors present
├── Assertion: Output contains expected text
├── Assertion: Max duration
├── Assertion: No errors
├── Assertion: Expected tools (across all executors)
├── Assertion: Per-executor expected tools
│
▼
WorkflowTestResult { ExecutionResult, AssertionResults, Duration }
AgentEval has 19 metrics across 6 categories. Each implements IMetric.EvaluateAsync(EvaluationContext) → MetricResult. The light path can surface all of them, but with varying data richness:
| Category | Metric Class | Name | Needs from Context | Light Path Compat |
|---|---|---|---|---|
| RAG/Quality | FaithfulnessMetric |
llm_faithfulness |
Input, Output, Context | Partial ¹ |
RelevanceMetric |
llm_relevance |
Input, Output | Full | |
ContextPrecisionMetric |
llm_context_precision |
Input, Output, Context, GroundTruth | Partial ¹ | |
ContextRecallMetric |
llm_context_recall |
Input, Output, Context, GroundTruth | Partial ¹ | |
AnswerCorrectnessMetric |
llm_answer_correctness |
Input, Output, GroundTruth | Partial ¹ | |
| Quality | GroundednessMetric |
llm_groundedness |
Input, Output, Context | Partial ¹ |
CoherenceMetric |
llm_coherence |
Input, Output | Full | |
FluencyMetric |
llm_fluency |
Input, Output | Full | |
| Agentic | ToolSelectionMetric |
code_tool_selection |
Input, Output, ToolUsage | Partial ² |
ToolArgumentsMetric |
code_tool_arguments |
Input, Output, ToolUsage | Partial ² | |
ToolSuccessMetric |
code_tool_success |
ToolUsage | Partial ² | |
TaskCompletionMetric |
llm_task_completion |
Input, Output | Full | |
ToolEfficiencyMetric |
code_tool_efficiency |
ToolUsage + Performance | No ³ | |
| Safety | ToxicityMetric |
llm_toxicity |
Input, Output | Full |
BiasMetric |
llm_bias |
Input, Output | Full | |
MisinformationMetric |
llm_misinformation |
Input, Output | Full | |
| Retrieval | RecallAtKMetric |
code_recall_at_k |
Context, GroundTruth | Partial ¹ |
MRRMetric |
code_mrr |
Context, GroundTruth | Partial ¹ | |
| Conversation | ConversationCompletenessMetric |
llm_conversation_completeness |
Input, Output | Full |
Legend:
- Full — Works perfectly through light path. Only needs Input + Output from
ChatMessage[]/ChatResponse. - Partial ¹ — Works IF the caller passes
Context/GroundTruththrough MEAI'sadditionalContextparameter. Requires customEvaluationContextsubtypes (Phase 6). - Partial ² — Can extract tool call names/arguments from
FunctionCallContent/FunctionResultContentin the conversation messages. But no timing data — no per-tool duration, no start/end timestamps.ToolUsageReportwill have call names butDurationwill be zero. - No ³ —
ToolEfficiencyMetricrequiresPerformanceMetrics(token counts, cost, total duration) which MAF's orchestration does not provide. Deep path only.
- 9 metrics work fully through the light path (quality, safety, conversation, task completion)
- 9 metrics work partially (RAG needs additionalContext; agentic gets tool names but no timing)
- 1 metric requires deep path (tool efficiency needs performance data)
This is a strong story: "9 of our 19 metrics work out of the box. Another 9 work with data enrichment. And our deep path gives you the full picture."
File: src/AgentEval.MAF/Evaluators/AgentEvalMetricAdapter.cs
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using AgentEval.Core;
using AgentEvalEvaluationContext = AgentEval.Core.EvaluationContext;
using MEAIEvaluationContext = Microsoft.Extensions.AI.Evaluation.EvaluationContext;
using MEAIIEvaluator = Microsoft.Extensions.AI.Evaluation.IEvaluator;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Wraps a single AgentEval <see cref="IMetric"/> as an MEAI <see cref="MEAIIEvaluator"/>,
/// enabling AgentEval metrics to be used inside MAF's <c>agent.EvaluateAsync()</c>
/// orchestration alongside MEAI and Foundry evaluators.
/// </summary>
/// <remarks>
/// This is the "light path" — post-mortem evaluation of a conversation transcript.
/// For live evaluation with streaming, tool timelines, and workflow graph analysis,
/// use <see cref="MAFEvaluationHarness"/> or <see cref="WorkflowEvaluationHarness"/> (deep path).
/// </remarks>
public class AgentEvalMetricAdapter : MEAIIEvaluator
{
private readonly IMetric _metric;
public AgentEvalMetricAdapter(IMetric metric)
{
_metric = metric ?? throw new ArgumentNullException(nameof(metric));
}
public async Task<EvaluationResult> EvaluateAsync(
IList<ChatMessage> messages,
ChatResponse response,
ChatConfiguration? chatConfiguration = null,
IEnumerable<MEAIEvaluationContext>? additionalContext = null,
CancellationToken cancellationToken = default)
{
// 1. Extract input/output from conversation (last-turn split)
var input = ConversationExtractor.ExtractLastUserMessage(messages);
var output = response.Message?.Text ?? "";
// 2. Build AgentEval EvaluationContext with available data
var context = new AgentEvalEvaluationContext
{
Input = input,
Output = output,
Context = AdditionalContextHelper.Extract<string>(additionalContext, "RAGContext"),
GroundTruth = AdditionalContextHelper.Extract<string>(additionalContext, "GroundTruth"),
ToolUsage = ConversationExtractor.ExtractToolUsage(messages, response),
};
// 3. Run the AgentEval metric
var metricResult = await _metric.EvaluateAsync(context, cancellationToken);
// 4. Convert to MEAI EvaluationResult
return ResultConverter.ToMEAI(metricResult);
}
}File: src/AgentEval.MAF/Evaluators/AgentEvalEvaluator.cs
/// <summary>
/// Composite evaluator that bundles multiple AgentEval metrics into a single
/// MEAI <see cref="MEAIIEvaluator"/>. Each metric produces a named metric
/// in the returned <see cref="EvaluationResult"/>.
/// </summary>
public class AgentEvalEvaluator : MEAIIEvaluator
{
private readonly IReadOnlyList<IMetric> _metrics;
public AgentEvalEvaluator(IEnumerable<IMetric> metrics)
{
_metrics = (metrics ?? throw new ArgumentNullException(nameof(metrics))).ToList();
}
public async Task<EvaluationResult> EvaluateAsync(
IList<ChatMessage> messages,
ChatResponse response,
ChatConfiguration? chatConfiguration = null,
IEnumerable<MEAIEvaluationContext>? additionalContext = null,
CancellationToken cancellationToken = default)
{
var input = ConversationExtractor.ExtractLastUserMessage(messages);
var output = response.Message?.Text ?? "";
var context = new AgentEvalEvaluationContext
{
Input = input,
Output = output,
Context = AdditionalContextHelper.Extract<string>(additionalContext, "RAGContext"),
GroundTruth = AdditionalContextHelper.Extract<string>(additionalContext, "GroundTruth"),
ToolUsage = ConversationExtractor.ExtractToolUsage(messages, response),
};
// Run all metrics and collect results
var result = new EvaluationResult();
foreach (var metric in _metrics)
{
var metricResult = await metric.EvaluateAsync(context, cancellationToken);
ResultConverter.AddToEvaluationResult(result, metricResult);
}
return result;
}
}File: src/AgentEval.MAF/Evaluators/AgentEvalEvaluators.cs
using AgentEval.Metrics.Agentic;
using AgentEval.Metrics.RAG;
using AgentEval.Metrics.Safety;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Factory for creating AgentEval evaluators that implement MEAI's IEvaluator.
/// Use these with MAF's <c>agent.EvaluateAsync()</c> extension methods.
/// </summary>
/// <remarks>
/// Mirrors the discoverability pattern of <c>FoundryEvals.RELEVANCE</c>.
/// </remarks>
public static class AgentEvalEvaluators
{
// ── Preset Bundles ──────────────────────────────────────────────
/// <summary>Quality metrics: faithfulness, relevance, coherence, fluency.</summary>
public static AgentEvalEvaluator Quality(IChatClient judgeClient) => new([
new FaithfulnessMetric(judgeClient),
new RelevanceMetric(judgeClient),
new CoherenceMetric(judgeClient),
new FluencyMetric(judgeClient)]);
/// <summary>RAG metrics: faithfulness, relevance, context precision/recall, answer correctness.</summary>
public static AgentEvalEvaluator RAG(IChatClient judgeClient) => new([
new FaithfulnessMetric(judgeClient),
new RelevanceMetric(judgeClient),
new ContextPrecisionMetric(judgeClient),
new ContextRecallMetric(judgeClient),
new AnswerCorrectnessMetric(judgeClient)]);
/// <summary>Agentic metrics: tool selection, success, arguments (no timing data in light path).</summary>
public static AgentEvalEvaluator Agentic() => new([
new ToolSuccessMetric(),
new ToolSelectionMetric([])]); // caller overrides expected tools
/// <summary>Agentic metrics with expected tools specified.</summary>
public static AgentEvalEvaluator Agentic(IEnumerable<string> expectedTools) => new([
new ToolSuccessMetric(),
new ToolSelectionMetric(expectedTools),
new ToolArgumentsMetric(expectedTools)]);
/// <summary>Safety metrics: toxicity, bias, misinformation.</summary>
public static AgentEvalEvaluator Safety(IChatClient judgeClient) => new([
new ToxicityMetric(judgeClient),
new BiasMetric(judgeClient),
new MisinformationMetric(judgeClient)]);
/// <summary>All available metrics (quality + RAG + agentic + safety).</summary>
public static AgentEvalEvaluator Advanced(IChatClient judgeClient) => new([
new FaithfulnessMetric(judgeClient),
new RelevanceMetric(judgeClient),
new CoherenceMetric(judgeClient),
new FluencyMetric(judgeClient),
new GroundednessMetric(judgeClient),
new ContextPrecisionMetric(judgeClient),
new ContextRecallMetric(judgeClient),
new AnswerCorrectnessMetric(judgeClient),
new ToolSuccessMetric(),
new TaskCompletionMetric(judgeClient),
new ToxicityMetric(judgeClient),
new BiasMetric(judgeClient),
new MisinformationMetric(judgeClient)]);
// ── Individual Metrics as MEAI IEvaluator ───────────────────────
public static MEAIIEvaluator Faithfulness(IChatClient c) => new AgentEvalMetricAdapter(new FaithfulnessMetric(c));
public static MEAIIEvaluator Relevance(IChatClient c) => new AgentEvalMetricAdapter(new RelevanceMetric(c));
public static MEAIIEvaluator Coherence(IChatClient c) => new AgentEvalMetricAdapter(new CoherenceMetric(c));
public static MEAIIEvaluator Fluency(IChatClient c) => new AgentEvalMetricAdapter(new FluencyMetric(c));
public static MEAIIEvaluator Groundedness(IChatClient c) => new AgentEvalMetricAdapter(new GroundednessMetric(c));
public static MEAIIEvaluator ToolSuccess() => new AgentEvalMetricAdapter(new ToolSuccessMetric());
public static MEAIIEvaluator TaskCompletion(IChatClient c) => new AgentEvalMetricAdapter(new TaskCompletionMetric(c));
public static MEAIIEvaluator Toxicity(IChatClient c) => new AgentEvalMetricAdapter(new ToxicityMetric(c));
public static MEAIIEvaluator Bias(IChatClient c) => new AgentEvalMetricAdapter(new BiasMetric(c));
public static MEAIIEvaluator Misinformation(IChatClient c) => new AgentEvalMetricAdapter(new MisinformationMetric(c));
// ── Custom Composition ──────────────────────────────────────────
/// <summary>Create a custom evaluator from specific AgentEval metrics.</summary>
public static AgentEvalEvaluator Custom(params IMetric[] metrics) => new(metrics);
}File: src/AgentEval.MAF/Evaluators/ConversationExtractor.cs
using Microsoft.Extensions.AI;
using AgentEval.Models;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Extracts AgentEval-compatible data from MEAI conversation types.
/// </summary>
public static class ConversationExtractor
{
/// <summary>Extracts the last user message as input (last-turn split).</summary>
public static string ExtractLastUserMessage(IList<ChatMessage> messages)
{
for (int i = messages.Count - 1; i >= 0; i--)
{
if (messages[i].Role == ChatRole.User)
return messages[i].Text ?? "";
}
return messages.FirstOrDefault()?.Text ?? "";
}
/// <summary>
/// Extracts tool usage from conversation messages.
/// Captures tool names, arguments, and results — but NOT timing (unavailable in light path).
/// </summary>
public static ToolUsageReport? ExtractToolUsage(
IList<ChatMessage> messages, ChatResponse response)
{
var allMessages = messages.Concat(response.Messages ?? []);
var report = new ToolUsageReport();
var pendingCalls = new Dictionary<string, ToolCallRecord>();
int order = 0;
foreach (var message in allMessages)
{
foreach (var content in message.Contents ?? [])
{
if (content is FunctionCallContent call)
{
var record = new ToolCallRecord
{
Name = call.Name,
CallId = call.CallId ?? Guid.NewGuid().ToString(),
Arguments = call.Arguments,
Order = ++order,
};
pendingCalls[record.CallId] = record;
}
else if (content is FunctionResultContent result)
{
var callId = result.CallId ?? "";
if (pendingCalls.TryGetValue(callId, out var pending))
{
pending.Result = result.Result;
pending.Exception = result.Exception;
report.AddCall(pending);
pendingCalls.Remove(callId);
}
}
}
}
// Add any unmatched calls
foreach (var pending in pendingCalls.Values.OrderBy(c => c.Order))
report.AddCall(pending);
return report.Count > 0 ? report : null;
}
}File: src/AgentEval.MAF/Evaluators/ResultConverter.cs
using Microsoft.Extensions.AI.Evaluation;
using AgentEval.Core;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Converts between AgentEval and MEAI evaluation result types.
/// </summary>
public static class ResultConverter
{
/// <summary>Converts an AgentEval MetricResult to an MEAI EvaluationResult.</summary>
public static EvaluationResult ToMEAI(MetricResult metricResult)
{
var result = new EvaluationResult();
AddToEvaluationResult(result, metricResult);
return result;
}
/// <summary>Adds an AgentEval MetricResult as a named metric in an MEAI EvaluationResult.</summary>
public static void AddToEvaluationResult(EvaluationResult result, MetricResult metricResult)
{
// Convert 0-100 → 1-5 for MEAI NumericMetric
var meaiScore = ScoreNormalizer.ToOneToFive(metricResult.Score);
var numericMetric = new NumericMetric(metricResult.MetricName, meaiScore);
// Add interpretation with the original 0-100 score and explanation
var interpretation = $"AgentEval score: {metricResult.Score:F0}/100 " +
$"({ScoreNormalizer.Interpret(metricResult.Score)})";
if (!string.IsNullOrEmpty(metricResult.Explanation))
interpretation += $" — {metricResult.Explanation}";
numericMetric.AddDiagnostic(
MetricResultDiagnosticSeverity.Informational,
interpretation);
result.Metrics[metricResult.MetricName] = numericMetric;
}
}File: src/AgentEval.MAF/Evaluators/AdditionalContextHelper.cs
using MEAIEvaluationContext = Microsoft.Extensions.AI.Evaluation.EvaluationContext;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Extracts typed data from MEAI's additionalContext parameter.
/// </summary>
public static class AdditionalContextHelper
{
/// <summary>Extracts a value by key from additional context.</summary>
public static T? Extract<T>(
IEnumerable<MEAIEvaluationContext>? additionalContext,
string key) where T : class
{
if (additionalContext == null) return null;
foreach (var ctx in additionalContext)
{
// Check custom context subtypes
if (key == "RAGContext" && ctx is AgentEvalContexts.RAGContext ragCtx)
return ragCtx.RetrievedContext as T;
if (key == "GroundTruth" && ctx is AgentEvalContexts.GroundTruthContext gtCtx)
return gtCtx.GroundTruth as T;
}
return null;
}
}File: src/AgentEval.MAF/AgentEval.MAF.csproj
Add:
<PackageReference Include="Microsoft.Extensions.AI.Evaluation" />
<PackageReference Include="Microsoft.Extensions.AI.Evaluation.Quality" />When AgentEval metrics run through the light path, they only receive what MAF's orchestration provides. This table shows exactly where data is lost:
| Data | Available in Light Path? | How | Deep Path Provides |
|---|---|---|---|
| Input text | Yes | Last ChatRole.User message |
Same |
| Output text | Yes | ChatResponse.Message.Text |
Same |
| RAG context | If provided | Caller passes RAGContext via additionalContext |
TestCase.Context → EvaluationContext.Context |
| Ground truth | If provided | Caller passes GroundTruthContext via additionalContext |
TestCase.GroundTruth → EvaluationContext.GroundTruth |
| Tool call names | Yes | Extracted from FunctionCallContent in messages |
Same + ordering |
| Tool call arguments | Yes | From FunctionCallContent.Arguments |
Same |
| Tool call results | Yes | From FunctionResultContent.Result |
Same |
| Tool call timing | No | MAF's orchestration doesn't timestamp tool calls | Per-tool StartTime/EndTime/Duration |
| TTFT | No | No streaming in light path | StreamingOptions.OnFirstToken |
| Token counts | No | Not tracked by MAF orchestration | Actual from provider or estimated |
| Cost estimation | No | No token data → no cost | ModelPricing.EstimateCost() |
| Workflow graph | No | Not applicable | WorkflowGraphSnapshot |
| Per-executor breakdown | No | Not in single-evaluator path | ExecutorStep with per-executor data |
| Failure diagnostics | No | Just pass/fail scores | FailureReport with reasons + suggestions |
"9 of our 19 metrics work fully through the light path out of the box — quality, safety, conversation, task completion. Another 9 work with data enrichment via
additionalContext. Only 1 (tool efficiency) strictly requires our deep path.But the data gap matters: through
IEvaluator, we get what tools were called but not how long they took. We get the response text but not time-to-first-token. We get pass/fail but not why it failed with root cause analysis.That's why both paths exist. Light path for CI gates and quick checks. Deep path when you need to actually debug and optimize."
"We implemented the light path. AgentEval metrics now implement MEAI's
IEvaluator. Demo:"var results = await agent.EvaluateAsync( queries: ["What's the weather in Seattle?"], evaluator: AgentEvalEvaluators.Quality(judgeClient)); results.AssertAllPassed();"19 metrics — faithfulness, tool success, toxicity, bias — alongside MEAI evaluators and Foundry. Zero new concepts."
Show the limitations table (Section 15). Emphasize:
- Agentic metrics are partially blind —
IEvaluatorgets tool calls fromFunctionCallContentin messages, but no timing, no duration, no start/end timestamps - No streaming data — TTFT is critical for UX evaluation. Impossible through post-mortem evaluation
- No workflow graph —
run.EvaluateAsync()does per-agent LINQ extraction but doesn't capture edge traversal, routing decisions, parallel branches - No stochastic analysis —
num_repetitionsruns N times but doesn't compute P90, std dev, success rate distributions
"The right model is two integration surfaces:
Light path — AgentEval as
IEvaluatorinside MAF. Already working. 9 metrics fully compatible, 9 more with data enrichment. Good for CI gates, quick quality checks, mixing with Foundry.Deep path — AgentEval's evaluation engine wrapping MAF agents. For comprehensive evaluation: streaming + tool timelines + workflow graph analysis + stochastic + model comparison + red teaming + memory evaluation. Results convert to
AgentEvaluationResultsfor interop.No duplication — light path uses MAF's orchestration, deep path uses ours. Same metrics power both."
Concrete asks:
-
Is
AgentEvaluationResultspublic in the current RC? We need the type for result interop. -
Can we co-define the
additionalContextprotocol? AgentEval needs to passGroundTruth,RAGContext,ExpectedToolCallsthrough MEAI'sIEnumerable<EvaluationContext>. If we align on a schema, any evaluator can use it — not just AgentEval. -
IConversationSplitter— is it defined yet? We should align. If MAF hasn't implemented it, we can propose it. -
Would you accept a PR adding tool call timing to the orchestration layer? Even a lightweight
Stopwatcharound each tool call inAgentEvaluationExtensionswould let evaluators get durations. This makes the light path useful for more agentic metrics without needing AgentEval's deep path. -
Upstream contributions — Our
ConversationExtractorand expected tool call validation could benefit MAF'sLocalEvaluatorandEvalChecks.
"MAF provides the native evaluation primitives and Foundry integration. AgentEval is the advanced evaluation engine that plugs in. Light path for simplicity, deep path for depth. One ecosystem, no fragmentation.
The light path is done. Next: result interop so deep path results convert to
AgentEvaluationResults, and then we extend the deep path with ADR-0020 concepts like conversation split strategies and expected tool call assertions."
Objective: Enable
agent.EvaluateAsync(queries, AgentEvalEvaluators.Quality(judgeClient))to work — any AgentEvalIMetricusable as an MEAIIEvaluatorinside MAF's orchestration layer.Estimated effort: 3–4 hours for a developer familiar with the codebase.
Prerequisites: Working AgentEval build;
Microsoft.Extensions.AI.Evaluation.Qualityv10.3.0 already inDirectory.Packages.props.
Before writing any code, verify the foundation.
0.1 Confirm the MEAI evaluation packages are resolvable:
cd c:\git\AgentEval
dotnet restore src/AgentEval.MAF/AgentEval.MAF.csproj0.2 Confirm the MEAI IEvaluator signature you'll implement. It lives in Microsoft.Extensions.AI.Evaluation (transitive dependency of .Quality). The actual contract (verified during implementation) is:
namespace Microsoft.Extensions.AI.Evaluation;
public interface IEvaluator
{
IReadOnlyCollection<string> EvaluationMetricNames { get; }
ValueTask<EvaluationResult> EvaluateAsync(
IEnumerable<ChatMessage> messages,
ChatResponse response,
ChatConfiguration? chatConfiguration = null,
IEnumerable<EvaluationContext>? additionalContext = null,
CancellationToken cancellationToken = default);
}Implementation note (discovered during build): The actual MEAI v10.3.0
IEvaluatorinterface differs from what was initially assumed in this plan in several ways:
- Returns
ValueTask<EvaluationResult>notTask<EvaluationResult>- Requires
IReadOnlyCollection<string> EvaluationMetricNamesproperty- Uses
IEnumerable<ChatMessage>notIList<ChatMessage>ChatResponsehas.Textproperty, not.Message.TextEvaluationContextbase class requires constructor args(string name, string content)— cannot use parameterless constructors orrequired initpropertiesNumericMetricusesInterpretationproperty withEvaluationMetricInterpretation(rating, failed, reason)— noAddDiagnosticmethodNumericMetricconstructor is(string name, double? value, string? reason)(3 args)All code in Sections 14.1–14.6 below reflects the originally planned design. The actual implemented code in
src/AgentEval.MAF/Evaluators/incorporates these adaptations.
0.3 Confirm the key MEAI result types you'll produce:
EvaluationResult— hasIDictionary<string, EvaluationMetric> MetricsNumericMetric—new NumericMetric(string name, double? value, string? reason)→ value is on 1–5 scaleEvaluationMetricInterpretation—new(EvaluationRating rating, bool failed, string? reason)EvaluationRating—Exceptional,Good,Average,Poor,Inconclusive
0.4 Verify ScoreNormalizer.ToOneToFive() exists at src/AgentEval.Core/Core/ScoreNormalizer.cs:33:
public static double ToOneToFive(double score)
{
var clamped = Math.Clamp(score, 0.0, 100.0);
return (clamped / 25.0) + 1.0;
}This is the reverse of FromOneToFive() used by MicrosoftEvaluatorAdapter. You will reuse this.
0.5 Review the mirror adapter at src/AgentEval.Core/Adapters/MicrosoftEvaluatorAdapter.cs — your implementation is the structural reverse. Study lines 59–155 for the pattern of building ChatMessage lists, calling the evaluator, extracting metrics, and converting scores.
File: src/AgentEval.MAF/AgentEval.MAF.csproj
Action: Add the MEAI Evaluation package reference. The .Quality variant is already in Directory.Packages.props so its transitive base (Microsoft.Extensions.AI.Evaluation) is available. Add an explicit reference to guarantee the types resolve:
<ItemGroup>
<ProjectReference Include="../AgentEval.Abstractions/AgentEval.Abstractions.csproj" />
<ProjectReference Include="../AgentEval.Core/AgentEval.Core.csproj" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Microsoft.Agents.AI" />
<PackageReference Include="Microsoft.Agents.AI.Workflows" />
<!-- NEW: MEAI Evaluation for IEvaluator implementation -->
<PackageReference Include="Microsoft.Extensions.AI.Evaluation.Quality" />
</ItemGroup>Why .Quality and not just .Evaluation: The .Quality package is already version-pinned in Directory.Packages.props at v10.3.0. It transitively includes the base .Evaluation package. Using it directly avoids needing a second <PackageVersion> entry. It also gives us access to NumericMetric, BooleanMetric, etc. which live in the base package.
Verify: dotnet restore src/AgentEval.MAF/AgentEval.MAF.csproj — must succeed.
Action: Create the Evaluators folder inside the MAF project:
src/AgentEval.MAF/
├── MAF/ ← existing (7 files, untouched)
│ ├── MAFAgentAdapter.cs
│ ├── MAFIdentifiableAgentAdapter.cs
│ ├── MAFWorkflowEventBridge.cs
│ ├── MAFWorkflowAdapter.cs
│ ├── MAFGraphExtractor.cs
│ ├── MAFEvaluationHarness.cs
│ └── WorkflowEvaluationHarness.cs
└── Evaluators/ ← NEW (6 files)
├── AgentEvalMetricAdapter.cs
├── AgentEvalEvaluator.cs
├── AgentEvalEvaluators.cs
├── ConversationExtractor.cs
├── ResultConverter.cs
└── AdditionalContextHelper.cs
File: src/AgentEval.MAF/Evaluators/ConversationExtractor.cs
Why first: This has zero internal dependencies — other files depend on it but it depends on nothing in Evaluators/. Build from the leaves inward.
Full implementation:
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 AgentEval Contributors
// Licensed under the MIT License.
using Microsoft.Extensions.AI;
using AgentEval.Models;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Extracts AgentEval-compatible data from MEAI <see cref="ChatMessage"/> conversation types.
/// Used by the light path adapters to convert MAF orchestration data into AgentEval's
/// <see cref="AgentEval.Core.EvaluationContext"/>.
/// </summary>
/// <remarks>
/// <para>
/// The light path receives a conversation transcript (<c>IList<ChatMessage></c> + <c>ChatResponse</c>)
/// from MAF's orchestration layer. This class extracts:
/// </para>
/// <list type="bullet">
/// <item>Input text — last user message (last-turn split, matching ADR-0020 default)</item>
/// <item>Tool usage — <see cref="FunctionCallContent"/>/<see cref="FunctionResultContent"/> pairs</item>
/// </list>
/// <para>
/// <b>Limitation:</b> Tool call timing is NOT available through the light path.
/// <see cref="ToolCallRecord.StartTime"/>, <see cref="ToolCallRecord.EndTime"/>, and
/// <see cref="ToolCallRecord.Duration"/> will be null. For tool timing, use the deep path
/// via <see cref="AgentEval.MAF.MAFEvaluationHarness"/>.
/// </para>
/// </remarks>
public static class ConversationExtractor
{
/// <summary>
/// Extracts the last user message text from a conversation (last-turn split).
/// This matches ADR-0020's default split strategy.
/// </summary>
/// <param name="messages">The conversation messages.</param>
/// <returns>
/// The text of the last <see cref="ChatRole.User"/> message, or the first message's text
/// as fallback, or empty string if no messages exist.
/// </returns>
public static string ExtractLastUserMessage(IList<ChatMessage> messages)
{
if (messages == null || messages.Count == 0)
return "";
// Walk backwards to find the last user message (last-turn split)
for (int i = messages.Count - 1; i >= 0; i--)
{
if (messages[i].Role == ChatRole.User)
return messages[i].Text ?? "";
}
// Fallback: return first message text regardless of role
return messages[0].Text ?? "";
}
/// <summary>
/// Extracts all user messages concatenated (full-conversation split).
/// Used when a metric needs the complete input trajectory.
/// </summary>
/// <param name="messages">The conversation messages.</param>
/// <returns>All user messages joined with newlines.</returns>
public static string ExtractAllUserMessages(IList<ChatMessage> messages)
{
if (messages == null || messages.Count == 0)
return "";
return string.Join("\n", messages
.Where(m => m.Role == ChatRole.User)
.Select(m => m.Text ?? "")
.Where(t => !string.IsNullOrEmpty(t)));
}
/// <summary>
/// Extracts tool usage from conversation messages and response.
/// </summary>
/// <remarks>
/// <para>
/// Pairs <see cref="FunctionCallContent"/> (tool invocation) with
/// <see cref="FunctionResultContent"/> (tool result) by matching on
/// <see cref="FunctionCallContent.CallId"/>.
/// </para>
/// <para>
/// <b>No timing data:</b> The light path receives a static conversation transcript.
/// There are no timestamps on when tools started or finished.
/// <see cref="ToolCallRecord.StartTime"/>, <see cref="ToolCallRecord.EndTime"/>,
/// and <see cref="ToolCallRecord.Duration"/> will all be null.
/// </para>
/// </remarks>
/// <param name="messages">The conversation messages (input side).</param>
/// <param name="response">The chat response (output side).</param>
/// <returns>
/// A <see cref="ToolUsageReport"/> with tool call records, or null if no tools were called.
/// </returns>
public static ToolUsageReport? ExtractToolUsage(
IList<ChatMessage> messages,
ChatResponse response)
{
var report = new ToolUsageReport();
var pendingCalls = new Dictionary<string, ToolCallRecord>();
int order = 0;
// Iterate all messages: input conversation + response messages
IEnumerable<ChatMessage> allMessages = messages;
if (response.Messages != null)
allMessages = allMessages.Concat(response.Messages);
// Also include the primary response message if it has content
if (response.Message != null)
allMessages = allMessages.Append(response.Message);
foreach (var message in allMessages)
{
if (message.Contents == null)
continue;
foreach (var content in message.Contents)
{
if (content is FunctionCallContent call)
{
var callId = call.CallId ?? Guid.NewGuid().ToString("N");
var record = new ToolCallRecord
{
Name = call.Name,
CallId = callId,
Arguments = call.Arguments,
Order = ++order,
// No timing data in light path — these remain null
};
pendingCalls[callId] = record;
}
else if (content is FunctionResultContent result)
{
var callId = result.CallId ?? "";
if (pendingCalls.TryGetValue(callId, out var pending))
{
pending.Result = result.Result;
pending.Exception = result.Exception;
report.AddCall(pending);
pendingCalls.Remove(callId);
}
else
{
// Result without matching call — create a standalone record
report.AddCall(new ToolCallRecord
{
Name = $"unknown_{callId}",
CallId = callId,
Result = result.Result,
Exception = result.Exception,
Order = ++order,
});
}
}
}
}
// Add any calls that never got a result (still pending)
foreach (var pending in pendingCalls.Values.OrderBy(c => c.Order))
report.AddCall(pending);
return report.Count > 0 ? report : null;
}
}Verify: Ensure it compiles: dotnet build src/AgentEval.MAF/AgentEval.MAF.csproj
File: src/AgentEval.MAF/Evaluators/AdditionalContextHelper.cs
Why second: Zero internal dependencies. AgentEvalMetricAdapter will use this.
Full implementation:
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 AgentEval Contributors
// Licensed under the MIT License.
using MEAIEvaluationContext = Microsoft.Extensions.AI.Evaluation.EvaluationContext;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Extracts typed data from MEAI's <c>additionalContext</c> parameter.
/// </summary>
/// <remarks>
/// <para>
/// MEAI's <see cref="Microsoft.Extensions.AI.Evaluation.IEvaluator.EvaluateAsync"/> accepts
/// an <c>IEnumerable<EvaluationContext>?</c> parameter for passing extra data.
/// AgentEval uses custom subtypes to carry RAG context, ground truth, and expected tool calls.
/// </para>
/// <para>
/// When users call <c>agent.EvaluateAsync()</c> with AgentEval evaluators and need RAG or
/// ground truth evaluation, they pass these custom context objects.
/// </para>
/// </remarks>
public static class AdditionalContextHelper
{
/// <summary>
/// Extracts the RAG retrieval context from additional context, if provided.
/// </summary>
public static string? ExtractRAGContext(IEnumerable<MEAIEvaluationContext>? additionalContext)
{
if (additionalContext == null) return null;
foreach (var ctx in additionalContext)
{
if (ctx is AgentEvalRAGContext ragCtx)
return ragCtx.RetrievedContext;
}
return null;
}
/// <summary>
/// Extracts the ground truth from additional context, if provided.
/// </summary>
public static string? ExtractGroundTruth(IEnumerable<MEAIEvaluationContext>? additionalContext)
{
if (additionalContext == null) return null;
foreach (var ctx in additionalContext)
{
if (ctx is AgentEvalGroundTruthContext gtCtx)
return gtCtx.GroundTruth;
}
return null;
}
/// <summary>
/// Extracts expected tool names from additional context, if provided.
/// </summary>
public static IReadOnlyList<string>? ExtractExpectedTools(IEnumerable<MEAIEvaluationContext>? additionalContext)
{
if (additionalContext == null) return null;
foreach (var ctx in additionalContext)
{
if (ctx is AgentEvalExpectedToolsContext toolsCtx)
return toolsCtx.ExpectedToolNames;
}
return null;
}
}
// ═══════════════════════════════════════════════════════════════════════
// CUSTOM EVALUATION CONTEXT SUBTYPES
// ═══════════════════════════════════════════════════════════════════════
/// <summary>
/// Carries RAG retrieval context through MEAI's <c>additionalContext</c> parameter.
/// </summary>
/// <example>
/// <code>
/// var results = await agent.EvaluateAsync(queries, evaluator,
/// additionalContext: [new AgentEvalRAGContext { RetrievedContext = "..." }]);
/// </code>
/// </example>
public class AgentEvalRAGContext : MEAIEvaluationContext
{
/// <summary>The retrieved context/documents for RAG evaluation.</summary>
public required string RetrievedContext { get; init; }
}
/// <summary>
/// Carries ground truth expected output through MEAI's <c>additionalContext</c> parameter.
/// </summary>
/// <example>
/// <code>
/// var results = await agent.EvaluateAsync(queries, evaluator,
/// additionalContext: [new AgentEvalGroundTruthContext { GroundTruth = "Paris is the capital of France" }]);
/// </code>
/// </example>
public class AgentEvalGroundTruthContext : MEAIEvaluationContext
{
/// <summary>The expected ground truth answer.</summary>
public required string GroundTruth { get; init; }
}
/// <summary>
/// Carries expected tool call names through MEAI's <c>additionalContext</c> parameter.
/// </summary>
public class AgentEvalExpectedToolsContext : MEAIEvaluationContext
{
/// <summary>Names of tools expected to be called.</summary>
public required IReadOnlyList<string> ExpectedToolNames { get; init; }
}Verify: dotnet build src/AgentEval.MAF/AgentEval.MAF.csproj
File: src/AgentEval.MAF/Evaluators/ResultConverter.cs
Why third: Depends on ScoreNormalizer (in AgentEval.Core, already referenced). No dependency on other Evaluators/ files.
Full implementation:
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 AgentEval Contributors
// Licensed under the MIT License.
using Microsoft.Extensions.AI.Evaluation;
using AgentEval.Core;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Converts between AgentEval <see cref="MetricResult"/> and MEAI <see cref="EvaluationResult"/> types.
/// </summary>
/// <remarks>
/// <para>
/// AgentEval uses a 0–100 score scale. MEAI uses a 1–5 scale for <see cref="NumericMetric"/>.
/// Conversion uses <see cref="ScoreNormalizer.ToOneToFive"/> (existing utility):
/// </para>
/// <list type="table">
/// <item><term>0 → 1.0</term><description>Very Poor</description></item>
/// <item><term>25 → 2.0</term><description>Poor</description></item>
/// <item><term>50 → 3.0</term><description>Satisfactory</description></item>
/// <item><term>75 → 4.0</term><description>Good</description></item>
/// <item><term>100 → 5.0</term><description>Excellent</description></item>
/// </list>
/// <para>
/// The original 0–100 score is preserved in the metric's diagnostic message for full fidelity.
/// </para>
/// </remarks>
public static class ResultConverter
{
/// <summary>
/// Converts a single AgentEval <see cref="MetricResult"/> to an MEAI <see cref="EvaluationResult"/>
/// containing one named metric.
/// </summary>
public static EvaluationResult ToMEAI(MetricResult metricResult)
{
var result = new EvaluationResult();
AddToEvaluationResult(result, metricResult);
return result;
}
/// <summary>
/// Adds an AgentEval <see cref="MetricResult"/> as a named metric inside an existing
/// MEAI <see cref="EvaluationResult"/>. Call this multiple times for composite evaluators.
/// </summary>
/// <param name="result">The MEAI result to add the metric to.</param>
/// <param name="metricResult">The AgentEval metric result to convert.</param>
public static void AddToEvaluationResult(EvaluationResult result, MetricResult metricResult)
{
// Convert 0-100 AgentEval scale → 1-5 MEAI scale
var meaiScore = ScoreNormalizer.ToOneToFive(metricResult.Score);
var numericMetric = new NumericMetric(metricResult.MetricName, meaiScore);
// Build a diagnostic message preserving the original 0-100 score
var interpretation = $"AgentEval score: {metricResult.Score:F0}/100 " +
$"({ScoreNormalizer.Interpret(metricResult.Score)})";
if (!string.IsNullOrEmpty(metricResult.Explanation))
interpretation += $" — {metricResult.Explanation}";
// Add pass/fail status
interpretation += metricResult.Passed ? " [PASSED]" : " [FAILED]";
numericMetric.AddDiagnostic(
metricResult.Passed
? MetricResultDiagnosticSeverity.Informational
: MetricResultDiagnosticSeverity.Warning,
interpretation);
// Add detail diagnostics if available
if (metricResult.Details != null)
{
foreach (var (key, value) in metricResult.Details)
{
numericMetric.AddDiagnostic(
MetricResultDiagnosticSeverity.Informational,
$"{key}: {value}");
}
}
result.Metrics[metricResult.MetricName] = numericMetric;
}
}Verify: dotnet build src/AgentEval.MAF/AgentEval.MAF.csproj
File: src/AgentEval.MAF/Evaluators/AgentEvalMetricAdapter.cs
Why fourth: This is the core adapter. Depends on ConversationExtractor, AdditionalContextHelper, ResultConverter (all built in steps 3–5).
Full implementation:
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 AgentEval Contributors
// Licensed under the MIT License.
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using AgentEval.Core;
// Alias to avoid conflict with AgentEval.Core.EvaluationContext
using AgentEvalEvaluationContext = AgentEval.Core.EvaluationContext;
using MEAIEvaluationContext = Microsoft.Extensions.AI.Evaluation.EvaluationContext;
using MEAIIEvaluator = Microsoft.Extensions.AI.Evaluation.IEvaluator;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Wraps a single AgentEval <see cref="IMetric"/> as an MEAI <see cref="MEAIIEvaluator"/>,
/// enabling AgentEval metrics to be used inside MAF's <c>agent.EvaluateAsync()</c>
/// orchestration alongside MEAI and Foundry evaluators.
/// </summary>
/// <remarks>
/// <para>
/// This is the "light path" adapter. It performs post-mortem evaluation of a conversation
/// transcript. The adapter:
/// </para>
/// <list type="number">
/// <item>Extracts input/output from <see cref="ChatMessage"/> conversation (last-turn split)</item>
/// <item>Extracts tool usage from <see cref="FunctionCallContent"/>/<see cref="FunctionResultContent"/> pairs</item>
/// <item>Extracts optional RAG context / ground truth from <c>additionalContext</c></item>
/// <item>Builds an AgentEval <see cref="AgentEvalEvaluationContext"/></item>
/// <item>Runs the wrapped <see cref="IMetric"/></item>
/// <item>Converts <see cref="MetricResult"/> → MEAI <see cref="EvaluationResult"/></item>
/// </list>
/// <para>
/// <b>Limitations vs. deep path:</b> No streaming data (TTFT), no tool call timing,
/// no performance metrics (token counts, cost), no workflow graph.
/// For those capabilities, use <see cref="AgentEval.MAF.MAFEvaluationHarness"/>.
/// </para>
/// </remarks>
public class AgentEvalMetricAdapter : MEAIIEvaluator
{
private readonly IMetric _metric;
/// <summary>
/// Creates an adapter wrapping an AgentEval metric as an MEAI evaluator.
/// </summary>
/// <param name="metric">The AgentEval metric to wrap.</param>
public AgentEvalMetricAdapter(IMetric metric)
{
_metric = metric ?? throw new ArgumentNullException(nameof(metric));
}
/// <summary>Gets the name of the wrapped AgentEval metric.</summary>
public string MetricName => _metric.Name;
/// <inheritdoc/>
public async Task<EvaluationResult> EvaluateAsync(
IList<ChatMessage> messages,
ChatResponse response,
ChatConfiguration? chatConfiguration = null,
IEnumerable<MEAIEvaluationContext>? additionalContext = null,
CancellationToken cancellationToken = default)
{
// 1. Extract input/output from conversation (last-turn split — ADR-0020 default)
var input = ConversationExtractor.ExtractLastUserMessage(messages);
var output = response.Message?.Text ?? "";
// 2. Build AgentEval EvaluationContext with all available data
var context = new AgentEvalEvaluationContext
{
Input = input,
Output = output,
Context = AdditionalContextHelper.ExtractRAGContext(additionalContext),
GroundTruth = AdditionalContextHelper.ExtractGroundTruth(additionalContext),
ToolUsage = ConversationExtractor.ExtractToolUsage(messages, response),
ExpectedTools = AdditionalContextHelper.ExtractExpectedTools(additionalContext),
};
// 3. Run the AgentEval metric
var metricResult = await _metric.EvaluateAsync(context, cancellationToken);
// 4. Convert MetricResult → MEAI EvaluationResult
return ResultConverter.ToMEAI(metricResult);
}
}Verify: dotnet build src/AgentEval.MAF/AgentEval.MAF.csproj
File: src/AgentEval.MAF/Evaluators/AgentEvalEvaluator.cs
Why fifth: Composite — depends on the same helpers as AgentEvalMetricAdapter, bundles N metrics.
Full implementation:
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 AgentEval Contributors
// Licensed under the MIT License.
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using AgentEval.Core;
using AgentEvalEvaluationContext = AgentEval.Core.EvaluationContext;
using MEAIEvaluationContext = Microsoft.Extensions.AI.Evaluation.EvaluationContext;
using MEAIIEvaluator = Microsoft.Extensions.AI.Evaluation.IEvaluator;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Composite evaluator that bundles multiple AgentEval <see cref="IMetric"/> instances
/// into a single MEAI <see cref="MEAIIEvaluator"/>. Each metric produces a named metric
/// in the returned <see cref="EvaluationResult"/>.
/// </summary>
/// <remarks>
/// <para>
/// Use this when you want to run multiple AgentEval metrics in a single evaluator call.
/// Construct via <see cref="AgentEvalEvaluators"/> factory for preset bundles.
/// </para>
/// <para>
/// All metrics share the same <see cref="AgentEvalEvaluationContext"/> built from
/// the conversation transcript. Metrics are run sequentially to avoid contention
/// on the judge LLM (LLM-evaluated metrics make API calls).
/// </para>
/// </remarks>
public class AgentEvalEvaluator : MEAIIEvaluator
{
private readonly IReadOnlyList<IMetric> _metrics;
/// <summary>
/// Creates a composite evaluator from a collection of AgentEval metrics.
/// </summary>
/// <param name="metrics">The metrics to include in this evaluator.</param>
public AgentEvalEvaluator(IEnumerable<IMetric> metrics)
{
_metrics = (metrics ?? throw new ArgumentNullException(nameof(metrics))).ToList();
if (_metrics.Count == 0)
throw new ArgumentException("At least one metric must be provided.", nameof(metrics));
}
/// <summary>Gets the number of metrics in this evaluator.</summary>
public int MetricCount => _metrics.Count;
/// <summary>Gets the names of all included metrics.</summary>
public IEnumerable<string> MetricNames => _metrics.Select(m => m.Name);
/// <inheritdoc/>
public async Task<EvaluationResult> EvaluateAsync(
IList<ChatMessage> messages,
ChatResponse response,
ChatConfiguration? chatConfiguration = null,
IEnumerable<MEAIEvaluationContext>? additionalContext = null,
CancellationToken cancellationToken = default)
{
// 1. Extract input/output once (shared across all metrics)
var input = ConversationExtractor.ExtractLastUserMessage(messages);
var output = response.Message?.Text ?? "";
// 2. Build EvaluationContext once
var context = new AgentEvalEvaluationContext
{
Input = input,
Output = output,
Context = AdditionalContextHelper.ExtractRAGContext(additionalContext),
GroundTruth = AdditionalContextHelper.ExtractGroundTruth(additionalContext),
ToolUsage = ConversationExtractor.ExtractToolUsage(messages, response),
ExpectedTools = AdditionalContextHelper.ExtractExpectedTools(additionalContext),
};
// 3. Run all metrics sequentially, collecting results
var result = new EvaluationResult();
foreach (var metric in _metrics)
{
cancellationToken.ThrowIfCancellationRequested();
try
{
var metricResult = await metric.EvaluateAsync(context, cancellationToken);
ResultConverter.AddToEvaluationResult(result, metricResult);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
// If a single metric fails, add a failed entry and continue
var failedResult = MetricResult.Fail(
metric.Name,
$"Metric execution failed: {ex.Message}");
ResultConverter.AddToEvaluationResult(result, failedResult);
}
}
return result;
}
}Verify: dotnet build src/AgentEval.MAF/AgentEval.MAF.csproj
File: src/AgentEval.MAF/Evaluators/AgentEvalEvaluators.cs
Why last: This is the public API surface. Depends on AgentEvalEvaluator, AgentEvalMetricAdapter, and all metric classes.
Full implementation:
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 AgentEval Contributors
// Licensed under the MIT License.
using Microsoft.Extensions.AI;
using AgentEval.Core;
using AgentEval.Metrics.Agentic;
using AgentEval.Metrics.RAG;
using AgentEval.Metrics.Safety;
using AgentEval.Metrics;
using MEAIIEvaluator = Microsoft.Extensions.AI.Evaluation.IEvaluator;
namespace AgentEval.MAF.Evaluators;
/// <summary>
/// Factory for creating AgentEval evaluators that implement MEAI's
/// <see cref="MEAIIEvaluator"/> interface.
/// </summary>
/// <remarks>
/// <para>
/// Use these with MAF's <c>agent.EvaluateAsync()</c> extension methods:
/// </para>
/// <code>
/// var results = await agent.EvaluateAsync(
/// queries: ["What's the weather in Seattle?"],
/// evaluator: AgentEvalEvaluators.Quality(judgeClient));
/// results.AssertAllPassed();
/// </code>
/// <para>
/// Mirrors the discoverability pattern of MAF's <c>FoundryEvals.RELEVANCE</c>.
/// </para>
/// </remarks>
public static class AgentEvalEvaluators
{
// ═══════════════════════════════════════════════════════════════════
// PRESET BUNDLES — return AgentEvalEvaluator (composite)
// ═══════════════════════════════════════════════════════════════════
/// <summary>
/// Quality metrics bundle: faithfulness, relevance, coherence, fluency.
/// All work fully through the light path (text-only evaluation).
/// </summary>
/// <param name="judgeClient">The LLM chat client used as the evaluation judge.</param>
public static AgentEvalEvaluator Quality(IChatClient judgeClient) => new([
new FaithfulnessMetric(judgeClient),
new RelevanceMetric(judgeClient),
new CoherenceMetric(judgeClient),
new FluencyMetric(judgeClient)]);
/// <summary>
/// RAG metrics bundle: faithfulness, relevance, context precision, context recall, answer correctness.
/// Requires <see cref="AgentEvalRAGContext"/> and/or <see cref="AgentEvalGroundTruthContext"/>
/// passed via <c>additionalContext</c> for full effectiveness.
/// </summary>
/// <param name="judgeClient">The LLM chat client used as the evaluation judge.</param>
public static AgentEvalEvaluator RAG(IChatClient judgeClient) => new([
new FaithfulnessMetric(judgeClient),
new RelevanceMetric(judgeClient),
new ContextPrecisionMetric(judgeClient),
new ContextRecallMetric(judgeClient),
new AnswerCorrectnessMetric(judgeClient)]);
/// <summary>
/// Agentic metrics bundle: tool success rate.
/// Tool names and arguments are extracted from <see cref="FunctionCallContent"/>
/// in conversation messages. Tool timing is NOT available (light path limitation).
/// </summary>
public static AgentEvalEvaluator Agentic() => new([
new ToolSuccessMetric()]);
/// <summary>
/// Agentic metrics bundle with expected tools: tool success, tool selection, tool arguments.
/// </summary>
/// <param name="expectedTools">The tool names expected to be called by the agent.</param>
public static AgentEvalEvaluator Agentic(IEnumerable<string> expectedTools) => new([
new ToolSuccessMetric(),
new ToolSelectionMetric(expectedTools),
new ToolArgumentsMetric(expectedTools)]);
/// <summary>
/// Safety metrics bundle: toxicity, bias, misinformation.
/// All work fully through the light path (text-only evaluation).
/// </summary>
/// <param name="judgeClient">The LLM chat client used as the evaluation judge.</param>
public static AgentEvalEvaluator Safety(IChatClient judgeClient) => new([
new ToxicityMetric(judgeClient),
new BiasMetric(judgeClient),
new MisinformationMetric(judgeClient)]);
/// <summary>
/// All available metrics combined (quality + agentic + safety + task completion).
/// The most comprehensive single-call evaluation.
/// </summary>
/// <param name="judgeClient">The LLM chat client used as the evaluation judge.</param>
public static AgentEvalEvaluator Advanced(IChatClient judgeClient) => new([
new FaithfulnessMetric(judgeClient),
new RelevanceMetric(judgeClient),
new CoherenceMetric(judgeClient),
new FluencyMetric(judgeClient),
new GroundednessMetric(judgeClient),
new ToolSuccessMetric(),
new TaskCompletionMetric(judgeClient),
new ToxicityMetric(judgeClient),
new BiasMetric(judgeClient),
new MisinformationMetric(judgeClient)]);
// ═══════════════════════════════════════════════════════════════════
// INDIVIDUAL METRICS — return MEAIIEvaluator (single metric)
// Use these when mixing AgentEval metrics with MEAI/Foundry evaluators.
// ═══════════════════════════════════════════════════════════════════
/// <summary>Faithfulness: is the response grounded in the provided context?</summary>
public static MEAIIEvaluator Faithfulness(IChatClient c) => Adapt(new FaithfulnessMetric(c));
/// <summary>Relevance: does the response address the user's question?</summary>
public static MEAIIEvaluator Relevance(IChatClient c) => Adapt(new RelevanceMetric(c));
/// <summary>Coherence: is the response logically consistent?</summary>
public static MEAIIEvaluator Coherence(IChatClient c) => Adapt(new CoherenceMetric(c));
/// <summary>Fluency: is the response grammatically correct and natural?</summary>
public static MEAIIEvaluator Fluency(IChatClient c) => Adapt(new FluencyMetric(c));
/// <summary>Groundedness: does the response avoid unsubstantiated claims?</summary>
public static MEAIIEvaluator Groundedness(IChatClient c) => Adapt(new GroundednessMetric(c));
/// <summary>Tool success: did all invoked tools execute without errors?</summary>
public static MEAIIEvaluator ToolSuccess() => Adapt(new ToolSuccessMetric());
/// <summary>Task completion: did the agent accomplish the requested task?</summary>
public static MEAIIEvaluator TaskCompletion(IChatClient c) => Adapt(new TaskCompletionMetric(c));
/// <summary>Toxicity: does the response contain harmful content?</summary>
public static MEAIIEvaluator Toxicity(IChatClient c) => Adapt(new ToxicityMetric(c));
/// <summary>Bias: does the response contain biased content?</summary>
public static MEAIIEvaluator Bias(IChatClient c) => Adapt(new BiasMetric(c));
/// <summary>Misinformation: does the response contain false claims?</summary>
public static MEAIIEvaluator Misinformation(IChatClient c) => Adapt(new MisinformationMetric(c));
// ═══════════════════════════════════════════════════════════════════
// CUSTOM COMPOSITION
// ═══════════════════════════════════════════════════════════════════
/// <summary>
/// Creates a custom composite evaluator from specific AgentEval metrics.
/// </summary>
public static AgentEvalEvaluator Custom(params IMetric[] metrics) => new(metrics);
/// <summary>
/// Wraps any AgentEval <see cref="IMetric"/> as an individual MEAI evaluator.
/// </summary>
public static MEAIIEvaluator Adapt(IMetric metric) => new AgentEvalMetricAdapter(metric);
}Verify: dotnet build src/AgentEval.MAF/AgentEval.MAF.csproj — this is the final build check. All 6 files should compile together.
Directory: tests/AgentEval.Tests/MAF/Evaluators/ (create if needed)
9.1 — ConversationExtractorTests.cs
Test cases:
ExtractLastUserMessage_WithMultipleMessages_ReturnsLastUserMessageExtractLastUserMessage_WithNoUserMessages_ReturnsFirstMessageExtractLastUserMessage_WithEmptyList_ReturnsEmptyStringExtractToolUsage_WithToolCalls_ReturnsPairedRecordsExtractToolUsage_WithNoToolCalls_ReturnsNullExtractToolUsage_WithUnmatchedResult_CreatesStandaloneRecordExtractToolUsage_WithUnmatchedCall_IncludesAsPendingExtractToolUsage_PreservesCallOrder
9.2 — ResultConverterTests.cs
Test cases:
ToMEAI_WithPassingResult_ReturnsNumericMetricWithCorrectScoreToMEAI_Score0_MapsTo1(verifyScoreNormalizer.ToOneToFive(0) == 1.0)ToMEAI_Score50_MapsTo3(verify midpoint)ToMEAI_Score100_MapsTo5(verify max)ToMEAI_WithExplanation_IncludesDiagnosticToMEAI_WithFailingResult_HasWarningDiagnosticAddToEvaluationResult_MultipleCalls_AddsMultipleMetrics
9.3 — AgentEvalMetricAdapterTests.cs
Test cases:
EvaluateAsync_WithTextOnlyMetric_ExtractsInputOutputCorrectlyEvaluateAsync_WithToolUsageMetric_ExtractsToolCallsEvaluateAsync_WithRAGContext_PassesContextToMetricEvaluateAsync_WithGroundTruth_PassesGroundTruthToMetricEvaluateAsync_WithNoAdditionalContext_NullContextAndGroundTruth
Use a mock IMetric that records the EvaluationContext it receives, then assert the context fields.
9.4 — AgentEvalEvaluatorTests.cs
Test cases:
EvaluateAsync_WithMultipleMetrics_ReturnsAllMetricResultsEvaluateAsync_WithFailingMetric_ContinuesAndIncludesFailureEvaluateAsync_WithCancellation_ThrowsOperationCanceledMetricCount_ReturnsCorrectCountConstructor_WithEmptyMetrics_ThrowsArgumentException
9.5 — AgentEvalEvaluatorsTests.cs
Test cases:
Quality_ReturnsEvaluatorWithFourMetricsRAG_ReturnsEvaluatorWithFiveMetricsSafety_ReturnsEvaluatorWithThreeMetricsAdvanced_ReturnsEvaluatorWithTenMetricsAgentic_WithExpectedTools_ReturnsEvaluatorWithThreeMetricsIndividual_Faithfulness_ReturnsAdapterWrappingFaithfulnessMetricCustom_WithArbitraryMetrics_CreatesEvaluatorAdapt_WrapsAnyIMetricAsIEvaluator
Verify: dotnet test tests/AgentEval.Tests/ --filter "FullyQualifiedName~MAF.Evaluators"
Create a minimal integration test that verifies the full flow without a live LLM.
File: tests/AgentEval.Tests/MAF/Evaluators/LightPathIntegrationTests.cs
[Fact]
public async Task LightPath_ToolSuccessMetric_ExtractsToolCallsFromConversation()
{
// Arrange — simulate a conversation with tool calls (no LLM needed)
var messages = new List<ChatMessage>
{
new(ChatRole.User, "What's the weather in Seattle?"),
new(ChatRole.Assistant, [new FunctionCallContent("call-1", "get_weather",
new Dictionary<string, object?> { ["city"] = "Seattle" })]),
new(ChatRole.Tool, [new FunctionResultContent("call-1", "get_weather",
result: "72°F and sunny")]),
};
var response = new ChatResponse(new ChatMessage(ChatRole.Assistant,
"The weather in Seattle is 72°F and sunny."));
// Act — run ToolSuccessMetric through the light path
var evaluator = AgentEvalEvaluators.ToolSuccess();
var result = await evaluator.EvaluateAsync(messages, response);
// Assert — metric should have produced a result
Assert.True(result.Metrics.ContainsKey("code_tool_success"));
var metric = result.Metrics["code_tool_success"];
Assert.IsType<NumericMetric>(metric);
var numericMetric = (NumericMetric)metric;
// Score should be 5.0 (100/100 → 5.0 on MEAI scale) — all tools succeeded
Assert.Equal(5.0, numericMetric.Value, precision: 1);
}Verify: dotnet test tests/AgentEval.Tests/ --filter "LightPathIntegrationTests"
Run through each verification in order:
| # | Check | Command | Expected |
|---|---|---|---|
| 1 | Full solution builds | dotnet build AgentEval.sln |
0 errors |
| 2 | All existing tests pass | dotnet test tests/AgentEval.Tests/ |
No regressions |
| 3 | New unit tests pass | dotnet test --filter "MAF.Evaluators" |
All green |
| 4 | Integration test passes | dotnet test --filter "LightPathIntegrationTests" |
Green |
| 5 | No new warnings | Check build output | Clean or pre-existing only |
| 6 | Code style | Consistent with existing MicrosoftEvaluatorAdapter.cs |
Same patterns |
| # | File | Lines (approx) | Depends On |
|---|---|---|---|
| 1 | src/AgentEval.MAF/Evaluators/ConversationExtractor.cs |
~120 | MEAI types, ToolUsageReport, ToolCallRecord |
| 2 | src/AgentEval.MAF/Evaluators/AdditionalContextHelper.cs |
~100 | MEAI EvaluationContext |
| 3 | src/AgentEval.MAF/Evaluators/ResultConverter.cs |
~80 | ScoreNormalizer, MEAI NumericMetric |
| 4 | src/AgentEval.MAF/Evaluators/AgentEvalMetricAdapter.cs |
~90 | Steps 1-3 |
| 5 | src/AgentEval.MAF/Evaluators/AgentEvalEvaluator.cs |
~100 | Steps 1-3 |
| 6 | src/AgentEval.MAF/Evaluators/AgentEvalEvaluators.cs |
~130 | Steps 4-5, all metric classes |
| Total new production code | ~620 | ||
| 7 | AgentEval.MAF.csproj (modified) |
+1 line | Package reference |
| 8 | Unit + integration tests | ~300 | All above |
Meeting context: In-person or Teams call with Ben Thompson (MAF team lead for evaluation). Duration: 12 minutes presentation + 8 minutes discussion = 20 minutes total. Goal: Show working alignment, demonstrate value, secure collaboration commitment.
Visual: AgentEval logo + MAF logo side by side. Title: "AgentEval + Agent Framework: Better Together"
Text on slide:
- AgentEval: The .NET Evaluation Toolkit for AI Agents
- Integration with Agent Framework Evaluation (ADR-0020)
- [Your name] — [Date]
No script — just visible while you set up.
Visual: Code snippet centered on slide, large font:
// Before: MEAI evaluators only
var results = await agent.EvaluateAsync(queries,
new RelevanceEvaluator());
// Now: AgentEval metrics plug in directly
var results = await agent.EvaluateAsync(queries,
AgentEvalEvaluators.Quality(judgeClient));
results.AssertAllPassed();Script (read aloud, conversational):
"Ben, first — thank you for sharing the ADR. I want to start by showing you what we've already built.
After reading your ADR, we implemented what we're calling the 'light path.' AgentEval's metrics now implement MEAI's
IEvaluatordirectly. Here's what it looks like.[Point to code] Same
agent.EvaluateAsync()you designed. SameAssertAllPassed()pattern. But behind that evaluator, there are four AgentEval metrics running — faithfulness, relevance, coherence, fluency — all powered by LLM-as-judge.We have 19 metrics total across quality, RAG, agentic, safety, and responsible AI. Nine of them work fully through this path out of the box. Another nine work with data enrichment. We also have preset bundles —
.Quality(),.RAG(),.Safety(),.Agentic()— that mirror yourFoundryEvals.RELEVANCEdiscoverability pattern.And it mixes seamlessly:"
Show second snippet (same slide or transition):
var results = await agent.EvaluateAsync(queries,
evaluators: [
new RelevanceEvaluator(), // MEAI native
AgentEvalEvaluators.Safety(judgeClient), // AgentEval bundle
new FoundryEvals(projectClient, "gpt-4o"), // Foundry cloud
]);"MEAI, AgentEval, and Foundry — all in one call. One ecosystem. This is live, it compiles, it runs."
Visual: Two-column comparison table, highlighted cells:
| Light Path (IEvaluator) | Deep Path (AgentEval Engine) | |
|---|---|---|
| Quality metrics (text) | Full | Full |
| Safety metrics (text) | Full | Full |
| Tool call names/args | Yes | Yes |
| Tool call timing | -- | Per-tool start/end/duration |
| Time-to-first-token | -- | Via streaming |
| Token counts & cost | -- | Actual or estimated |
| Workflow graph | -- | Nodes, edges, traversal |
| Per-executor breakdown | -- | Steps, tools per executor |
| Stochastic (N runs + P90) | -- | Full statistics |
| Model comparison | -- | Rankings + winner |
| Red teaming | -- | 192 probes, OWASP/MITRE |
| Failure root cause | -- | FailureReport + suggestions |
| CI export | Portal | JUnit, SARIF, PDF |
Script:
"Now, here's the honest picture. The light path — plugging into
IEvaluator— gives you the conversation transcript after the fact. That's enough for quality and safety metrics. But it's post-mortem evaluation.[Point to right column] AgentEval also has what we call the 'deep path.' This wraps your MAF agents and workflows, and captures data that the
IEvaluatorcontract physically cannot carry.Time-to-first-token from streaming. Per-tool start and end timestamps with durations. Full workflow graph extraction — we use your
Workflow.ReflectEdges()API to build graph snapshots with edge traversal tracking. Per-executor breakdowns with tool calls per agent. Stochastic evaluation with P90 and standard deviation. Model comparison with rankings.We already have 2,500 lines of deep MAF integration code doing this — including handling your ChatProtocol detection and TurnToken sending, which is non-trivial.
Both paths share the same metrics. The difference is what data the metrics receive."
Visual: Clean architecture diagram:
┌─────────────────────────────────────────────────────┐
│ MAF User Code │
│ │
│ Light: agent.EvaluateAsync(q, AgentEvalEvals...) │
│ Deep: harness.RunEvaluationStreamingAsync(...) │
│ Deep: wfHarness.RunWorkflowTestAsync(...) │
└────────┬──────────────────────────┬──────────────────┘
│ │
MAF Orchestration AgentEval Engine
(runs agent, (streaming, tool
calls IEvaluator) timelines, graphs)
│ │
└──────────┬───────────────┘
│
AgentEval IMetric
(19 metrics, same code)
│
┌──────────┴──────────┐
│ Result Interop │
│ TestResult ↔ MEAI │
│ JUnit/SARIF/PDF │
└─────────────────────┘
Script:
"Here's the architecture. Two paths, one metric layer.
[Point to left] The light path uses your orchestration. MAF runs the agent, calls our evaluator, gets results. Zero new concepts for your users.
[Point to right] The deep path uses our engine. Same metrics, but fed with richer data — streaming chunks, tool timelines, workflow events.
[Point to bottom] Both paths produce interoperable results. Deep path results convert to
AgentEvaluationResultsso users can call.AssertAllPassed(). Either path feeds into our export pipeline — JUnit XML for CI, SARIF for GitHub security tab, PDF for executive reporting.No duplication. Your orchestration handles agent invocation. Our engine handles evaluation intelligence. Clean separation."
Visual: Two columns: "What AgentEval Gets" / "What MAF Gets"
| AgentEval Gets | MAF Gets |
|---|---|
| First-class integration with MAF ecosystem | 19 additional metrics beyond MEAI's 6 quality evaluators |
| Visibility to MAF users | LLM-as-judge evaluation (faithfulness, task completion) with no extra work |
| Alignment with Foundry pattern | Agentic metrics (tool selection, tool success, tool arguments) |
| Deep path validated by real MAF workflows | Safety metrics (toxicity, bias, misinformation) not in Foundry |
| RAG metrics (context precision/recall, answer correctness) | |
| Red teaming — 192 probes, OWASP LLM Top 10 | |
| Memory evaluation — benchmark suite for agents with memory | |
Stochastic evaluation — statistical analysis your num_repetitions doesn't compute |
|
| Model comparison — side-by-side with rankings | |
| CI/CD export — JUnit, SARIF, PDF beyond Foundry portal |
Script:
"I want to be clear about why this is a win for MAF, not just for AgentEval.
Right now, MAF ships with MEAI's quality evaluators — relevance, coherence, fluency, groundedness, equivalence, completeness. That's six metrics. Foundry adds cloud-based evaluators but requires Azure setup.
[Point to right column] With AgentEval plugging in via
IEvaluator, your users instantly get 19 more metrics. Faithfulness that catches hallucinations. Tool selection and success metrics for agentic workflows. Toxicity, bias, and misinformation detection. RAG evaluation with context precision and recall.And these aren't toys. They're battle-tested with calibrated multi-judge consensus, stochastic evaluation, and the deep path gives workflow-level analysis that no other .NET evaluation tool provides.
Your
LocalEvaluatorprovides lambda-based checks — keyword checks, tool called checks. AgentEval provides the LLM-as-judge intelligence layer above that.This is not AgentEval competing with MAF evaluation. It's AgentEval extending what MAF can offer. Your users write
agent.EvaluateAsync()— they get a much richer evaluation story."
Visual: Numbered list with icons
Script:
"Here's what I'd like to collaborate on. Five concrete items.
One —
AgentEvaluationResultstype availability. We need it to convert deep path results into MAF-compatible format. Is it public in the current RC? If not, when can we expect it?Two — Co-define the
additionalContextprotocol. MEAI'sIEvaluatorhas anadditionalContextparameter. Right now there's no standard schema for passing ground truth, RAG context, or expected tool calls through it. We've definedAgentEvalGroundTruthContext,AgentEvalRAGContext, andAgentEvalExpectedToolsContext. If we align on this, any evaluator — not just AgentEval — can use it. This benefits the whole ecosystem.Three —
IConversationSplitteralignment. Your ADR defines last-turn, full, and per-turn split strategies. We should align on the interface so our implementations are interchangeable.Four — this one benefits MAF directly: tool call timing in the orchestration layer. Right now,
AgentEvaluationExtensionsruns the agent and collects the response, but doesn't timestamp individual tool calls. Even a lightweightStopwatcharound each tool invocation would let evaluators — ours and others — get tool durations. This makes agentic evaluation much more useful through the light path. We're happy to contribute this PR.Five — upstream contributions. Our
ConversationExtractorand expected tool call validation logic could benefit MAF'sLocalEvaluatorandEvalChecks. We'd like to discuss contributing some of this."
Visual: Simple timeline
✅ Done Now Next 2 weeks After
─────────────┬─────────────┬─────────────────┬──────────
Light path │ This │ Result interop │ Upstream
implemented │ meeting │ (TestResult ↔ │ PRs to
│ │ AgentEvalResults)│ MAF
│ │ │
│ │ Conversation │ Deep path
│ │ split alignment │ extensions
│ │ │ with ADR-
│ │ additionalCtx │ 0020
│ │ schema co-def │ concepts
Script:
"To close — here's where we are and where we're going.
The light path is implemented. AgentEval metrics work as MEAI
IEvaluatortoday.Next two weeks: result interoperability so deep path results convert to
AgentEvaluationResults, conversation split alignment, and co-defining theadditionalContextschema.After that: upstream PRs to MAF — tool call timing, conversation extraction utilities — and extending our deep path with ADR-0020 concepts.
I think we're converging on something really strong here. AgentEval as the advanced evaluation engine behind MAF's native evaluation primitives. One ecosystem for .NET agent evaluation."
Prepared responses for likely questions:
Q: "Why not just contribute your metrics directly to MEAI?"
"Two reasons. First, many of our metrics are LLM-as-judge — they require an
IChatClientfor the judge model. MEAI's built-in evaluators are self-contained. Second, our value isn't just individual metrics — it's the evaluation engine: stochastic evaluation, calibrated multi-judge, model comparison, red teaming, memory benchmarks. Those don't fit into a singleIEvaluatorimplementation. The light path gives you our metrics. The deep path gives you our engine."
Q: "Is there overlap with Foundry Evals?"
"Complementary, not overlapping. Foundry gives you cloud-scale evaluation with dashboard views — great for production monitoring. AgentEval gives you developer-time evaluation — fast local feedback, CI integration, detailed debugging. Your
LocalEvaluatorpattern is exactly right for this. We extend it with LLM-as-judge intelligence."
Q: "What about the IAgentEvaluator interface mentioned in the ADR for .NET?"
"The ADR mentions
IAgentEvaluatorin the Python ↔ .NET mapping table as the .NET equivalent of Python'sEvaluatorprotocol — for batch-level evaluation across items. If you ship this interface, we'd implement that too. OurAgentEvalEvaluatoralready bundles metrics — adding batch semantics is straightforward."
Q: "Can you show the deep path working with our workflows?"
"Yes. We have
MAFWorkflowAdapter.FromMAFWorkflow()that takes a builtWorkflow, streams events viaInProcessExecution.RunStreamingAsync(), and captures per-executor steps, edge traversals, tool calls, and graph structure. We handle the ChatProtocol detection and TurnToken sending. Happy to do a deeper demo."
Q: "What about performance overhead of running 19 metrics?"
"The code-based metrics (tool selection, tool success, tool arguments, tool efficiency) are essentially free — pure in-memory computation. The LLM-as-judge metrics make API calls, so the cost is proportional to the number of LLM-evaluated metrics you select. That's why we have preset bundles —
.Quality()is 4 metrics,.Safety()is 3. Users pick what they need. We also haveCalibratedEvaluatorthat optimizes multi-judge costs."
Use these prompts with an image generation tool (DALL-E, Midjourney, or a design tool like Figma/Canva). Each prompt is designed for a clean, professional, tech-presentation style.
Image prompt:
A clean, modern tech presentation title slide. Two abstract geometric logos sit side by side in the center — one representing "Agent Framework" (a stylized interconnected node graph in Microsoft blue #0078D4) and one representing "AgentEval" (a stylized checkmark inside a hexagonal gauge in teal #00B294). Between them, a glowing bridge or connection line links the two logos, symbolizing integration. Background is dark navy (#1B1B2F) with subtle circuit-board-like traces fading into the edges. No text in the image — text will be overlaid.
Text to overlay:
AgentEval + Agent Framework
Better Together
Advanced Evaluation for the .NET Agent Ecosystem
[Your name] — 2026-03-26
Design notes: Minimal, high-contrast, no clutter. The bridge visual between logos is the anchor — it communicates "integration" instantly.
No image prompt needed — this is a code-on-dark-background slide. Use your IDE or a syntax-highlighted code block on dark theme. The code IS the visual.
Image prompt:
A professional infographic-style comparison visual for a tech presentation. Split vertically into two columns. LEFT column labeled "Light Path" shows a simple, clean pipeline: a document icon flowing through a single magnifying glass into a checkmark — representing post-mortem transcript evaluation. It should feel light, simple, fast. RIGHT column labeled "Deep Path" shows a rich, layered pipeline: a streaming data flow with multiple parallel tracks — a stopwatch (TTFT), a wrench with clock overlay (tool timing), a network graph (workflow), bar charts (statistics), and a shield (red teaming). The right side should feel deep, comprehensive, powerful. Use a consistent color scheme: left in light blue/silver tones, right in deep teal/gold tones. Dark background (#1B1B2F). No text labels in the image — text will be overlaid as a table.
Table content to overlay on or below the image:
| Capability | Light Path | Deep Path |
|---|---|---|
| Quality metrics (text) | Full | Full |
| Safety metrics (text) | Full | Full |
| Tool call names & args | Yes | Yes |
| Tool call timing | -- | Per-tool start/end/duration |
| Time-to-first-token | -- | Via streaming |
| Token counts & cost | -- | Actual or estimated |
| Workflow graph | -- | Nodes, edges, traversal |
| Per-executor breakdown | -- | Steps, tools per executor |
| Stochastic (N runs + P90) | -- | Full statistics |
| Model comparison | -- | Rankings + winner |
| Red teaming | -- | 192 probes, OWASP/MITRE |
| Failure root cause | -- | FailureReport + suggestions |
| CI export | Portal only | JUnit, SARIF, PDF |
Design notes: The table should use green checkmarks for "Full"/"Yes", gray dashes for "--", and teal text for the deep path descriptions. Row striping for readability.
Image prompt:
A clean, modern software architecture diagram for a tech presentation on dark background (#1B1B2F). At the top, a wide rounded rectangle labeled "Developer Code" in white. Two paths descend from it: LEFT path goes through a box labeled "MAF Orchestration" (in Microsoft blue #0078D4) — this box is thin and simple, representing lightweight orchestration. RIGHT path goes through a box labeled "AgentEval Engine" (in teal #00B294) — this box is thicker and richer, with subtle icons inside representing streaming, graphs, and statistics. Both paths converge into a central diamond shape labeled "AgentEval IMetric — 19 metrics" in gold (#FFB900). Below the diamond, a final box labeled "Result Interop" shows two output arrows: one going left to "AgentEvaluationResults" and one going right to "JUnit / SARIF / PDF". Use clean connecting lines with arrowheads. Modern flat design, no 3D effects. Subtle glow on the convergence point to emphasize the shared metric layer.
Text labels for the diagram (overlay or embedded):
Developer Code
/ \
Light Path Deep Path
| |
MAF Orchestration AgentEval Engine
(runs agent, (streaming, tool
calls IEvaluator) timelines, graphs)
\ /
\ /
AgentEval IMetric Layer
(19 metrics, shared)
|
Result Interop
/ \
AgentEvaluationResults JUnit/SARIF/PDF
.AssertAllPassed() CI/CD Export
Design notes: The key visual moment is the convergence — two paths, one metric layer. The left path should look simple/thin, the right path should look rich/layered. Both feed into the same core.
Image prompt:
A professional presentation slide showing a win-win value exchange between two entities. On the LEFT side, a teal hexagonal badge labeled "AgentEval" with 3–4 small benefit icons flowing INTO it (integration, visibility, alignment, validation). On the RIGHT side, a blue hexagonal badge labeled "MAF" with 8–10 benefit icons flowing INTO it (metrics, safety, RAG, red team, memory, stochastic, comparison, export). The RIGHT side intentionally has MORE items flowing in — the visual weight should clearly show that MAF gets significant value. A golden handshake or bridge icon connects the two badges in the center. Dark background (#1B1B2F). Clean, infographic style. No text inside the image — text will be overlaid.
Content to overlay (two-column layout):
AgentEval Gets:
- First-class integration with .NET agent ecosystem
- Visibility to MAF's user base
- Alignment with Foundry evaluation pattern
- Deep path validated against real MAF workflows
MAF Gets:
- +19 metrics beyond MEAI's 6 (faithfulness, relevance, coherence, fluency, groundedness, task completion, tool selection, tool success, tool arguments, toxicity, bias, misinformation, context precision, context recall, answer correctness, and more)
- LLM-as-judge evaluation — no extra infrastructure
- Agentic metrics — tool selection, success, arguments
- Safety metrics — toxicity, bias, misinformation (not in Foundry)
- RAG evaluation — context precision/recall, answer correctness
- Red teaming — 192 probes, OWASP LLM Top 10, MITRE ATLAS
- Memory evaluation — benchmark suite for agents with memory
- Stochastic analysis — P90, std dev, success rate beyond
num_repetitions - Model comparison — side-by-side rankings with cost/latency
- CI/CD export — JUnit XML, SARIF (GitHub), PDF reports
Design notes: The asymmetry is intentional and powerful — AgentEval gets 4 things, MAF gets 10+. This visually communicates "this is not a balanced trade — MAF gets the bigger win." That's the right message.
Image prompt:
A professional presentation slide showing 5 numbered collaboration items as a visual roadmap. Style: clean cards or tiles arranged in a slight arc or staircase pattern, numbered 1–5. Each card has a small icon and a one-line label. Card 1: a box/package icon — "AgentEvaluationResults availability". Card 2: a handshake icon — "additionalContext protocol". Card 3: a scissors/split icon — "IConversationSplitter alignment". Card 4: a stopwatch icon — "Tool call timing in orchestration". Card 5: a git-merge/PR icon — "Upstream contributions". Each card has two small badges below it: one teal badge labeled "AgentEval" and one blue badge labeled "MAF", indicating who benefits. Cards 2, 3, 4, and 5 should show BOTH badges (mutual benefit). Card 1 shows only the teal badge (primarily AgentEval). Dark background (#1B1B2F). Modern flat design.
Content for each card (overlay):
| # | Ask | Benefits AgentEval | Benefits MAF |
|---|---|---|---|
| 1 | AgentEvaluationResults type — is it public in RC? |
Yes — needed for result interop | -- |
| 2 | Co-define additionalContext schema (ground truth, RAG context, expected tools) |
Yes — enables RAG/agentic metrics through light path | Yes — any evaluator can use it, not just AgentEval |
| 3 | IConversationSplitter alignment (last-turn, full, per-turn) |
Yes — interchangeable implementations | Yes — consistent split behavior across all evaluators |
| 4 | Tool call timing in MAF orchestration layer (Stopwatch around tool invocations) | Yes — makes agentic metrics richer through light path | Yes — ALL evaluators get tool duration data, not just AgentEval |
| 5 | Upstream PRs: ConversationExtractor, expected tool validation → MAF LocalEvaluator | Yes — code contribution visibility | Yes — MAF's LocalEvaluator gets richer built-in utilities |
Key message: 4 out of 5 asks benefit MAF directly. This is not a one-sided request.
Image prompt:
A clean horizontal timeline for a tech presentation. Three phases, left to right. PHASE 1 (leftmost, marked with a green checkmark): "Done" — a completed milestone marker. PHASE 2 (center, marked with a blue "now" dot): "Next 2 Weeks" — an active work zone with 3 small task cards below it. PHASE 3 (rightmost, marked with a teal future icon): "After" — a forward-looking zone with 2 task cards below it. The timeline line is a gradient from green (done) through blue (now) to teal (future). Each phase card has a subtle glow matching its phase color. Dark background (#1B1B2F). Modern, minimal, no clutter.
Content to overlay:
Phase 1: DONE ✅ Phase 2: NEXT 2 WEEKS Phase 3: AFTER
─────────────────────────────────────────────────────────────────────────────────
Light path implemented Result interop Upstream PRs to MAF
AgentEval metrics as TestResult ↔ Tool call timing
MEAI IEvaluator AgentEvaluationResults ConversationExtractor
EvalChecks utilities
Conversation split
alignment with Deep path extensions
ADR-0020 strategies with ADR-0020 concepts
additionalContext
schema co-definition
(with MAF team)
Design notes: The green/done phase should feel solid and stable. The blue/now phase should feel active and in-motion. The teal/future phase should feel open and aspirational. The visual tells the story: "we've already started, we have a plan, and the future is collaborative."
- Background: Dark navy #1B1B2F across all slides for consistency
- Primary text: White #FFFFFF
- AgentEval accent: Teal #00B294
- MAF/Microsoft accent: Blue #0078D4
- Highlight/emphasis: Gold #FFB900
- Font: Segoe UI or Inter — clean, modern, Microsoft-ecosystem-native
- Code blocks: Use dark theme syntax highlighting (VS Code Dark+ or similar)
- No decorative elements — every visual element should communicate information
- Slide count: 7 slides for 12 minutes = ~1.7 min per slide average. Comfortable pace.