evals: add negative case and strengthen state verification#123
Open
evals: add negative case and strengthen state verification#123
Conversation
Based on Anthropic's "Demystifying Evals for AI Agents" blog post: 1. Add negative case eval (test_no_late_runs.py) - Tests that agent correctly identifies when there are NO late runs - Prevents hallucinating problems in healthy scenarios - Blog: "Test both cases where behavior should occur and shouldn't" 2. Strengthen test_work_pool_concurrency with state verification - Add direct assertion that response mentions actual work pool name - Add tool call verification (must call get_work_pools) - Blog: "The outcome is the final state in the environment" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📊 ObservabilityView eval run traces in Logfire: prefect-mcp-server-evals @ e2fad8e |
Evaluation Results17 tests 15 ✅ 2m 18s ⏱️ For more details on these failures, see this check. Results for commit e2fad8e. ♻️ This comment has been updated with latest results. |
The test harness is session-scoped, so other tests create late runs that persist. Rather than fighting the infrastructure, scope the question to a specific deployment - which is also more realistic. The agent can mention late runs from OTHER deployments (they're real). The eval only fails if it claims THIS healthy deployment has late runs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The agent reasonably flagged a Scheduled run as "potentially stuck" even though it wasn't in Prefect's "Late" state. That's not wrong - it's cautious. Fix: only use Running and Completed states which are unambiguously healthy. No reasonable agent would flag these as problematic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
zzstoatzz
commented
Jan 9, 2026
Comment on lines
+117
to
+127
| # State verification: agent response must mention the actual work pool name | ||
| # This catches cases where the LLM evaluation might pass on vague responses | ||
| assert work_pool_name in result.output, ( | ||
| f"Response must mention the specific work pool '{work_pool_name}' " | ||
| f"but got: {result.output[:200]}..." | ||
| ) | ||
|
|
||
| # Tool verification: agent should have inspected work pools | ||
| tool_call_spy.assert_tool_was_called("get_work_pools") | ||
|
|
||
| # LLM evaluation for response quality |
Collaborator
Author
There was a problem hiding this comment.
Is this prescription actually necessary?
Contributor
There was a problem hiding this comment.
The eval below already validates that the specific work pool is mentioned 🤷
chrisguidry
approved these changes
Jan 12, 2026
Comment on lines
+117
to
+127
| # State verification: agent response must mention the actual work pool name | ||
| # This catches cases where the LLM evaluation might pass on vague responses | ||
| assert work_pool_name in result.output, ( | ||
| f"Response must mention the specific work pool '{work_pool_name}' " | ||
| f"but got: {result.output[:200]}..." | ||
| ) | ||
|
|
||
| # Tool verification: agent should have inspected work pools | ||
| tool_call_spy.assert_tool_was_called("get_work_pools") | ||
|
|
||
| # LLM evaluation for response quality |
Contributor
There was a problem hiding this comment.
The eval below already validates that the specific work pool is mentioned 🤷
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Applies two concrete recommendations from Anthropic's "Demystifying Evals for AI Agents" blog post:
1. Balanced problem sets (negative case)
Added
test_no_late_runs.py- a negative case where everything is healthy and the agent should correctly report no issues. This helps verify we're not doing anything to encourage hallucination2. Outcome state verification
Enhanced
test_work_pool_concurrencywith:Other recommendations from the blog post (not in this PR)
For future consideration:
Test plan
test_no_late_runsevaltest_work_pool_concurrencyeval🤖 Generated with Claude Code