Coming from another evaluation framework? This guide helps you transition to AgentEval with code examples and pattern translations.
If you're a .NET team evaluating AI agents, you may have started with a Python or Node.js framework. AgentEval brings that functionality to your native stack with some unique capabilities:
- Native .NET - No Python interop, no Node.js subprocess
- Fluent assertions - Express complex agent behavior evaluations naturally
- Tool-aware evaluation - First-class support for agentic tool calls
- stochastic evaluation - Built-in statistics for LLM non-determinism
- Trace record/replay - Deterministic CI without API costs
If you've been using Python frameworks for LLM/agent evaluation, here's how AgentEval maps to familiar concepts.
| Python Concept | AgentEval Equivalent |
|---|---|
| Faithfulness metric | FaithfulnessMetric |
| Relevance/Answer relevance | RelevanceMetric |
| Context precision | ContextPrecisionMetric |
| Context recall | ContextRecallMetric |
| Answer correctness | AnswerCorrectnessMetric |
| Custom LLM judge | Implement IMetric or IRAGMetric |
Python pattern:
# Python RAG evaluation (generic pattern)
test_case = TestCase(
input="What is the return policy?",
actual_output=agent_response,
retrieval_context=["Policy doc 1", "Policy doc 2"],
expected_output="30 days with receipt"
)
faithfulness_score = faithfulness_metric.evaluate(test_case)
relevance_score = relevance_metric.evaluate(test_case)AgentEval equivalent:
var context = new EvaluationContext
{
Input = "What is the return policy?",
Output = agentResponse,
Context = new[] { "Policy doc 1", "Policy doc 2" },
GroundTruth = "30 days with receipt"
};
var faithfulness = await new FaithfulnessMetric(evaluator).EvaluateAsync(context);
var relevance = await new RelevanceMetric(evaluator).EvaluateAsync(context);
// AgentEval adds: fluent assertions on the results
faithfulness.Score.Should().BeGreaterThan(80);Beyond RAG metrics, AgentEval provides capabilities not commonly found in Python frameworks:
// Tool usage assertions - verify agent called the right tools
result.ToolUsage!.Should()
.HaveCalledTool("SearchDatabase")
.BeforeTool("FormatResponse")
.WithArgument("query", "return policy");
// Performance SLAs - latency, cost, tokens
result.Performance!.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(5))
.HaveEstimatedCostUnder(0.10m);
// stochastic evaluation - statistical confidence
var stochasticResult = await stochasticRunner.RunStochasticTestAsync(
agent, testCase, new StochasticOptions(Runs: 10, SuccessRateThreshold: 0.8));If you've been using YAML/CLI-based evaluation tools, here's how to translate your workflows.
YAML-based config (generic pattern):
prompts:
- "Book a flight to {{destination}}"
providers:
- model: gpt-4o
config:
temperature: 0.7
tests:
- vars:
destination: Paris
assert:
- type: contains
value: "booking confirmed"
- type: cost
threshold: 0.05AgentEval equivalent:
var testCase = new TestCase
{
Name = "Book flight to Paris",
Input = "Book a flight to Paris",
ExpectedContains = new[] { "booking confirmed" }
};
var result = await harness.RunEvaluationAsync(agent, testCase);
// Assertions
result.ActualOutput.Should().Contain("booking confirmed");
result.Performance!.Should().HaveEstimatedCostUnder(0.05m);AgentEval supports the same dataset formats you're used to:
// Load from YAML
var loader = DatasetLoaderFactory.CreateFromExtension(".yaml");
var testCases = await loader.LoadAsync("tests.yaml");
// Load from JSONL
var loader = DatasetLoaderFactory.CreateFromExtension(".jsonl");
var testCases = await loader.LoadAsync("tests.jsonl");
// Load from CSV — field aliases are recognized automatically
// (e.g., "question" maps to Input, "answer" maps to ExpectedOutput)
var loader = DatasetLoaderFactory.CreateFromExtension(".csv");
var testCases = await loader.LoadAsync("tests.csv");
// Run evaluation for each test case
foreach (var tc in testCases)
{
var testCase = tc.ToTestCase(); // Convert DatasetTestCase → TestCase
var result = await harness.RunEvaluationAsync(agent, testCase);
// Collect results...
}If you're already using the Microsoft evaluation library, AgentEval builds on the same abstractions with additional capabilities.
Both libraries use Microsoft.Extensions.AI abstractions:
// Same IChatClient works in both
IChatClient chatClient = new AzureOpenAIChatClient(endpoint, credential, deployment);
// MS.Extensions.AI.Eval
var evaluator = new ChatClientEvaluator(chatClient);
var score = await evaluator.EvaluateAsync(response, criteria);
// AgentEval - builds on this with agent-specific features
var harness = new MAFEvaluationHarness();
var result = await harness.RunEvaluationAsync(agent, testCase);| Capability | MS.Extensions.AI.Eval | AgentEval |
|---|---|---|
| Basic evaluation | Yes | Yes |
| Tool call tracking | No | Full timeline |
| Tool ordering assertions | No | Yes |
| stochastic evaluation | Manual | Built-in |
| Model comparison | Manual | With recommendations |
| Trace record/replay | No | Yes |
| Behavioral policies | No | NeverCallTool, etc. |
// If you have this with MS.Extensions.AI.Eval:
var score = await evaluator.EvaluateAsync(response, "Is this helpful?");
// You can add AgentEval for agent-specific testing:
var result = await harness.RunEvaluationAsync(agent, testCase);
// Get RAG scores (like MS.Extensions.AI.Eval)
var faithfulness = await new FaithfulnessMetric(evaluator).EvaluateAsync(context);
// Plus agent-specific assertions
result.ToolUsage!.Should()
.HaveCalledTool("SearchProducts")
.HaveNoErrors();| Concept | Python Frameworks | AgentEval |
|---|---|---|
| Test case | TestCase, LLMTestCase |
TestCase |
| Evaluation context | Dict or object | EvaluationContext |
| Metric result | Float score | MetricResult (score + metadata) |
| Test result | Dict | TestResult |
| Assertions | assert statements |
Fluent .Should() |
| Type | Python | AgentEval |
|---|---|---|
| RAG metrics | Built-in | IRAGMetric implementations |
| Agent/tool metrics | Often missing | IAgenticMetric implementations |
| Embedding-based | Varies | EmbeddingSimilarityMetric |
| Custom | Inherit base class | Implement IMetric |
| Pattern | CLI Tools | AgentEval |
|---|---|---|
| Single test | CLI command | harness.RunEvaluationAsync() |
| Batch testing | YAML dataset | DatasetLoaderFactory + loop |
| Parallel | Varies | Parallel.ForEachAsync() |
| CI/CD output | Various formats | JUnit XML, Markdown, JSON |
dotnet add package AgentEval --prereleaseusing AgentEval.MAF;
using AgentEval.Models;
// Create test case (same structure as other frameworks)
var testCase = new TestCase
{
Name = "Customer Support Query",
Input = "How do I return a product?",
GroundTruth = "30-day return policy with receipt"
};
// Run test
var harness = new MAFEvaluationHarness();
var result = await harness.RunEvaluationAsync(agent, testCase);
// Assert (familiar patterns, more expressive)
Assert.True(result.Passed);
result.ActualOutput.Should().Contain("30 day");// Now leverage what AgentEval does uniquely well:
// Tool verification
result.ToolUsage!.Should()
.HaveCalledTool("LookupPolicy")
.HaveNoErrors();
// Performance SLAs
result.Performance!.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(3))
.HaveEstimatedCostUnder(0.05m);// stochastic evaluation - run multiple times, analyze statistics
var stochasticResult = await stochasticRunner.RunStochasticTestAsync(
agent, testCase,
new StochasticOptions
{
Runs = 10,
SuccessRateThreshold = 0.8
});
// Assert on statistical properties
Assert.True(stochasticResult.PassedThreshold);
stochasticResult.Statistics.Mean.Should().BeGreaterThan(75);- Getting Started Guide - Full tutorial
- Code Gallery - Real examples
- Samples - 26 runnable samples
- Issues - Report problems or ask questions
Ready to get started?