Skip to content

evals: add negative case and strengthen state verification#123

Open
zzstoatzz wants to merge 3 commits intomainfrom
eval-improvements
Open

evals: add negative case and strengthen state verification#123
zzstoatzz wants to merge 3 commits intomainfrom
eval-improvements

Conversation

@zzstoatzz
Copy link
Copy Markdown
Collaborator

@zzstoatzz zzstoatzz commented Jan 9, 2026

Summary

Applies two concrete recommendations from Anthropic's "Demystifying Evals for AI Agents" blog post:

1. Balanced problem sets (negative case)

"Test both the cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization."

Added test_no_late_runs.py - a negative case where everything is healthy and the agent should correctly report no issues. This helps verify we're not doing anything to encourage hallucination

2. Outcome state verification

"The outcome is the final state in the environment at the end of the trial."

Enhanced test_work_pool_concurrency with:

  • Direct assertion that response contains the actual work pool name (not just LLM evaluation)
  • Tool call verification that agent inspected work pools

Other recommendations from the blog post (not in this PR)

For future consideration:

Recommendation Notes
Multiple trials / pass@k Run evals N times to measure consistency. Would need infrastructure changes.
Capability vs regression markers Categorize evals. Regression = must stay 100%, capability = ok to fail sometimes.
Partial credit scoring Grade multiple dimensions instead of binary pass/fail.
Transcript metrics Track turns, tool calls, tokens per eval.
Grader calibration Periodic human review to verify LLM judge accuracy.

Test plan

  • CI passes on new test_no_late_runs eval
  • CI passes on enhanced test_work_pool_concurrency eval
  • Verify no regressions on other late_runs evals

🤖 Generated with Claude Code

Based on Anthropic's "Demystifying Evals for AI Agents" blog post:

1. Add negative case eval (test_no_late_runs.py)
   - Tests that agent correctly identifies when there are NO late runs
   - Prevents hallucinating problems in healthy scenarios
   - Blog: "Test both cases where behavior should occur and shouldn't"

2. Strengthen test_work_pool_concurrency with state verification
   - Add direct assertion that response mentions actual work pool name
   - Add tool call verification (must call get_work_pools)
   - Blog: "The outcome is the final state in the environment"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 9, 2026

📊 Observability

View eval run traces in Logfire: prefect-mcp-server-evals @ e2fad8e

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 9, 2026

Evaluation Results

17 tests   15 ✅  2m 18s ⏱️
 1 suites   0 💤
 1 files     2 ❌

For more details on these failures, see this check.

Results for commit e2fad8e.

♻️ This comment has been updated with latest results.

zzstoatzz and others added 2 commits January 9, 2026 14:18
The test harness is session-scoped, so other tests create late runs that
persist. Rather than fighting the infrastructure, scope the question to
a specific deployment - which is also more realistic.

The agent can mention late runs from OTHER deployments (they're real).
The eval only fails if it claims THIS healthy deployment has late runs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The agent reasonably flagged a Scheduled run as "potentially stuck"
even though it wasn't in Prefect's "Late" state. That's not wrong -
it's cautious.

Fix: only use Running and Completed states which are unambiguously
healthy. No reasonable agent would flag these as problematic.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@zzstoatzz zzstoatzz marked this pull request as ready for review January 9, 2026 20:40
Comment on lines +117 to +127
# State verification: agent response must mention the actual work pool name
# This catches cases where the LLM evaluation might pass on vague responses
assert work_pool_name in result.output, (
f"Response must mention the specific work pool '{work_pool_name}' "
f"but got: {result.output[:200]}..."
)

# Tool verification: agent should have inspected work pools
tool_call_spy.assert_tool_was_called("get_work_pools")

# LLM evaluation for response quality
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this prescription actually necessary?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eval below already validates that the specific work pool is mentioned 🤷

@zzstoatzz zzstoatzz requested a review from chrisguidry January 9, 2026 20:41
Comment on lines +117 to +127
# State verification: agent response must mention the actual work pool name
# This catches cases where the LLM evaluation might pass on vague responses
assert work_pool_name in result.output, (
f"Response must mention the specific work pool '{work_pool_name}' "
f"but got: {result.output[:200]}..."
)

# Tool verification: agent should have inspected work pools
tool_call_spy.assert_tool_was_called("get_work_pools")

# LLM evaluation for response quality
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eval below already validates that the specific work pool is mentioned 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants