evals: add negative case and strengthen state verification by zzstoatzz · Pull Request #123 · PrefectHQ/prefect-mcp-server

zzstoatzz · 2026-01-09T20:05:58Z

Summary

Applies two concrete recommendations from Anthropic's "Demystifying Evals for AI Agents" blog post:

1. Balanced problem sets (negative case)

"Test both the cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization."

Added test_no_late_runs.py - a negative case where everything is healthy and the agent should correctly report no issues. This helps verify we're not doing anything to encourage hallucination

2. Outcome state verification

"The outcome is the final state in the environment at the end of the trial."

Enhanced test_work_pool_concurrency with:

Direct assertion that response contains the actual work pool name (not just LLM evaluation)
Tool call verification that agent inspected work pools

Test plan

CI passes on new test_no_late_runs eval
CI passes on enhanced test_work_pool_concurrency eval
Verify no regressions on other late_runs evals

🤖 Generated with Claude Code

Based on Anthropic's "Demystifying Evals for AI Agents" blog post: 1. Add negative case eval (test_no_late_runs.py) - Tests that agent correctly identifies when there are NO late runs - Prevents hallucinating problems in healthy scenarios - Blog: "Test both cases where behavior should occur and shouldn't" 2. Strengthen test_work_pool_concurrency with state verification - Add direct assertion that response mentions actual work pool name - Add tool call verification (must call get_work_pools) - Blog: "The outcome is the final state in the environment" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-09T20:06:23Z

📊 Observability

View eval run traces in Logfire: prefect-mcp-server-evals @ e2fad8e

github-actions · 2026-01-09T20:08:53Z

Evaluation Results

17 tests 15 ✅ 2m 18s ⏱️
1 suites 0 💤
1 files 2 ❌

For more details on these failures, see this check.

Results for commit e2fad8e.

♻️ This comment has been updated with latest results.

The test harness is session-scoped, so other tests create late runs that persist. Rather than fighting the infrastructure, scope the question to a specific deployment - which is also more realistic. The agent can mention late runs from OTHER deployments (they're real). The eval only fails if it claims THIS healthy deployment has late runs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The agent reasonably flagged a Scheduled run as "potentially stuck" even though it wasn't in Prefect's "Late" state. That's not wrong - it's cautious. Fix: only use Running and Completed states which are unambiguously healthy. No reasonable agent would flag these as problematic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

zzstoatzz · 2026-01-09T20:40:41Z

+    # State verification: agent response must mention the actual work pool name
+    # This catches cases where the LLM evaluation might pass on vague responses
+    assert work_pool_name in result.output, (
+        f"Response must mention the specific work pool '{work_pool_name}' "
+        f"but got: {result.output[:200]}..."
+    )
+
+    # Tool verification: agent should have inspected work pools
+    tool_call_spy.assert_tool_was_called("get_work_pools")
+
+    # LLM evaluation for response quality


Is this prescription actually necessary?

The eval below already validates that the specific work pool is mentioned 🤷

chrisguidry · 2026-01-12T15:17:28Z

+    # State verification: agent response must mention the actual work pool name
+    # This catches cases where the LLM evaluation might pass on vague responses
+    assert work_pool_name in result.output, (
+        f"Response must mention the specific work pool '{work_pool_name}' "
+        f"but got: {result.output[:200]}..."
+    )
+
+    # Tool verification: agent should have inspected work pools
+    tool_call_spy.assert_tool_was_called("get_work_pools")
+
+    # LLM evaluation for response quality


The eval below already validates that the specific work pool is mentioned 🤷

zzstoatzz and others added 2 commits January 9, 2026 14:18

zzstoatzz marked this pull request as ready for review January 9, 2026 20:40

zzstoatzz commented Jan 9, 2026

View reviewed changes

zzstoatzz requested a review from chrisguidry January 9, 2026 20:41

chrisguidry approved these changes Jan 12, 2026

View reviewed changes

Recommendation	Notes
Multiple trials / pass@k	Run evals N times to measure consistency. Would need infrastructure changes.
Capability vs regression markers	Categorize evals. Regression = must stay 100%, capability = ok to fail sometimes.
Partial credit scoring	Grade multiple dimensions instead of binary pass/fail.
Transcript metrics	Track turns, tool calls, tokens per eval.
Grader calibration	Periodic human review to verify LLM judge accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: add negative case and strengthen state verification#123

evals: add negative case and strengthen state verification#123
zzstoatzz wants to merge 3 commits intomainfrom
eval-improvements

zzstoatzz commented Jan 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

zzstoatzz Jan 9, 2026

Uh oh!

chrisguidry Jan 12, 2026

Uh oh!

chrisguidry Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zzstoatzz commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Balanced problem sets (negative case)

2. Outcome state verification

Other recommendations from the blog post (not in this PR)

Test plan

Uh oh!

github-actions bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Observability

Uh oh!

github-actions bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluation Results

Uh oh!

zzstoatzz Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

chrisguidry Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

chrisguidry Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zzstoatzz commented Jan 9, 2026 •

edited

Loading

github-actions bot commented Jan 9, 2026 •

edited

Loading

github-actions bot commented Jan 9, 2026 •

edited

Loading