Add text-only testing framework for agents by aliev · Pull Request #364 · GetStream/Vision-Agents

aliev · 2026-02-20T14:37:41Z

Motivation

Testing conversational AI agents today requires spinning up audio/video infrastructure, edge connections, and real model calls for every assertion. This makes tests hard to write.

vision_agents.testing provides a lightweight, text-only testing layer that lets you verify agent behavior — tool calls, arguments, responses, and intent — using familiar pytest patterns.

What's included

Core API

TestSession — async context manager that wraps an LLM for testing. Manages session lifecycle, captures events (tool calls + outputs + messages), and returns structured results.
TestResponse — returned by simple_response(). Carries events, timing, output text, and cursor-based assertion methods:
- function_called(name, arguments=) — assert a tool was called with expected args (partial match)
- function_output(output=, is_error=) — assert tool output
- judge(intent=) — assert the assistant message (optionally evaluated by a judge LLM)
- no_more_events() — assert nothing unexpected happened after your checks
mock_tools(llm, {...}) — context manager to temporarily swap tool implementations without changing the schema visible to the LLM

LLM-as-judge

evaluate_intent() — sends the agent's message + a target intent to a separate judge LLM, gets a structured PASS/FAIL verdict. Used automatically when you call response.judge(intent="...").

Event types

ChatMessageEvent, FunctionCallEvent, FunctionCallOutputEvent — normalized dataclasses representing what happened during a turn
RunEvent — union type of all three

Usage

Verify a greeting:

async def test_greeting():
    async with TestSession(llm=llm, judge=judge_llm, instructions="Be friendly") as session:
        response = await session.simple_response("Hello")
        await response.judge(intent="Friendly greeting")
        response.no_more_events()

Verify tool calls and response:

async def test_weather():
    async with TestSession(llm=llm, judge=judge_llm, instructions="...") as session:
        response = await session.simple_response("Weather in Tokyo?")
        response.function_called("get_weather", arguments={"location": "Tokyo"})
        await response.judge(intent="Reports weather for Tokyo")
        response.no_more_events()

Mock tools for deterministic testing:

async def test_with_mock():
    async with TestSession(llm=llm, instructions="...") as session:
        with mock_tools(session.llm, {"get_weather": lambda location: "sunny, 70F"}):
            response = await session.simple_response("Weather in Berlin?")
            response.function_called("get_weather", arguments={"location": "Berlin"})

Design decisions

Cursor-based assertions — assertions advance a cursor through the event list, so checks read top-to-bottom like the actual conversation flow. function_called() auto-skips the following FunctionCallOutputEvent for convenience.
Separate judge LLM — the judge gets its own LLM instance to avoid polluting the agent's conversation history. This is optional — judge() works without intent evaluation too.
Partial argument matching — function_called("tool", arguments={"key": "val"}) only checks specified keys. Extra arguments are ignored.
pytest-native — no custom runner, no @pytest.mark.asyncio needed (asyncio_mode = auto), clean tracebacks via __tracebackhide__.

Files

agents-core/vision_agents/testing/
├── __init__.py          # public API + module docstring
├── _events.py           # ChatMessageEvent, FunctionCallEvent, FunctionCallOutputEvent
├── _judge.py            # LLM-as-judge evaluation
├── _mock_tools.py       # mock_tools() context manager
├── _run_result.py       # TestResponse dataclass + assertion methods
└── _session.py          # TestSession session lifecycle

tests/test_testing/
├── test_eval.py         # unit tests for TestResponse assertions
└── test_mock_tools.py   # unit tests for mock_tools

examples/01_simple_agent_example/
└── test_simple_agent.py # integration test example

Test plan

uv run py.test tests/test_testing/ -m "not integration" -n auto — all unit tests pass
uv run ruff check . — no lint issues
uv run ruff format --check . — formatted
uv run mypy — no type errors
uv run py.test examples/01_simple_agent_example/ -m integration

Summary by CodeRabbit

New Features
- Lightweight testing framework: session-based test runner, structured event events, TestResponse assertions, and automated intent judging for agent responses and tool calls.
- Context manager to mock tool behavior during tests while preserving public tool interfaces.
- Example updates: reusable LLM setup and instructions for sample agent.
Tests
- Comprehensive unit and integration tests covering assertions, tool mocking, intent judging, and example workflows.

Text-only testing SDK for Vision-Agents agents. Includes TestSession, fluent assertion API (RunResult/RunAssert), LLM-based judge, mock_tools, unit tests, integration test examples for 00_example and 01_simple_agent_example.

Covers all public classes, methods, event types, recommended patterns, and architecture rationale for testing at the LLM level.

Replace cursor-based assertion classes (RunAssert, EventAssert, etc.) with scenario-style methods on TestEval: user_says, agent_calls, agent_responds, no_more_events. ~500 lines removed.

Align with core LLM method naming and clarify intent.

TestEval now only handles lifecycle and LLM communication. TestResponse holds data (output, events, function_calls, duration_ms) and assertion methods (agent_calls, judge, no_more_events).

…utput

Reflect current API: TestResponse with function_called/function_output/judge, simple_response returns TestResponse, assertions on response not session.

Add table of contents, align tables, consolidate architecture section, add second quick start example with tool calls.

- Remove unused TestSession alias from __init__.py - Remove duplicate docstring, unused logging/os imports from _session.py - Deduplicate _evals_verbose (import from _run_result instead of redefining) - Fix class docstring/__test__ ordering in TestEval and TestResponse - Add -> None return type to _on_tool_start and _on_tool_end - Make _advance_to_type generic via TypeVar, remove redundant assert isinstance() - Remove unreachable RuntimeError, annotate _raise_with_debug_info as NoReturn - Remove dead else branch in _format_events - Replace except Exception with specific exceptions in _judge.py - Remove from __future__ import annotations, use quoted forward refs - Move evaluate_intent import to module level in _run_result.py

Drop VISION_AGENTS_EVALS_VERBOSE env var, associated print() calls, and _evals_verbose flag — premature for v1, avoids documenting and potentially deprecating a public env variable. Remove from __future__ import annotations from _events.py and _mock_tools.py (no forward references, Python 3.10+).

Replace two identical branches with a _VERDICTS mapping lookup, removing duplicated string slicing logic.

Extract _truncate, _format_event, _format_events as static methods on TestResponse — they only serve _raise_with_debug_info. Move magic numbers into documented module-level constants.

Docs live in a separate repo, so the in-package README is redundant. Move essential usage examples and key exports summary into the module docstring where help() and IDE tooltips can surface them.

coderabbitai · 2026-02-20T14:38:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new vision_agents.testing package providing event dataclasses, a TestSession async test harness, TestResponse assertion helpers, an LLM-based intent judge, a mock_tools context manager, updated example wiring, and unit/integration tests for the testing primitives.

Changes

Cohort / File(s)	Summary
Public package init `agents-core/vision_agents/testing/__init__.py`	Adds package docstring and re-exports: `TestSession`, `TestResponse`, event types (`ChatMessageEvent`, `FunctionCallEvent`, `FunctionCallOutputEvent`, `RunEvent`) and `mock_tools`.
Event types `agents-core/vision_agents/testing/_events.py`	New dataclasses: `ChatMessageEvent`, `FunctionCallEvent`, `FunctionCallOutputEvent` and union `RunEvent` for normalized test events.
Intent evaluation (judge) `agents-core/vision_agents/testing/_judge.py`	New async `evaluate_intent` that runs a judge LLM with a strict system prompt, parses single-line PASS/FAIL verdicts, and restores original LLM instructions; includes parsing and robust error handling.
Tool mocking utility `agents-core/vision_agents/testing/_mock_tools.py`	New `mock_tools` context manager to temporarily replace tool callables in an LLM's FunctionRegistry, validating presence and restoring originals on exit.
Session & runtime capture `agents-core/vision_agents/testing/_session.py`	New `TestSession` async context manager wiring an InMemoryConversation to an LLM, subscribing to tool start/end events, capturing `FunctionCallEvent`/`FunctionCallOutputEvent`, and returning `TestResponse` via `simple_response`.
Test response & assertions `agents-core/vision_agents/testing/_run_result.py`	New `TestResponse` dataclass with `build()` and assertion helpers: `function_called`, `function_output`, `judge`, and `no_more_events`, plus debug/formatting helpers and cursor-driven traversal.
Examples (refactor) `examples/01_simple_agent_example/simple_agent_example.py`	Refactors example to use `setup_llm` and module `INSTRUCTIONS`, registers a weather tool via `setup_llm`.
Integration tests (examples) `examples/01_simple_agent_example/test_simple_agent.py`	Adds integration tests using the testing utilities to validate greeting and a weather tool call (uses `setup_llm` and judge LLM).
Unit tests `tests/test_testing/test_eval.py`, `tests/test_testing/test_mock_tools.py`	Adds comprehensive unit tests for `TestResponse` assertions and `mock_tools` behavior, covering success paths, edge cases, and restoration semantics.
Docs/rules `CLAUDE.md`	Adds a single-line Python rule: "Never use from future import annotations".

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant TestSession as TestSession
    participant LLM as LLM
    participant EventMgr as EventManager
    participant Conv as InMemoryConversation
    participant Judge as JudgeLLM

    User->>TestSession: __aenter__/start()
    TestSession->>LLM: apply test instructions
    TestSession->>EventMgr: subscribe ToolStart/ToolEnd
    TestSession->>Conv: attach conversation
    User->>TestSession: simple_response("text")
    TestSession->>LLM: send user message
    LLM->>EventMgr: emit ToolStart / ToolEnd
    EventMgr->>TestSession: _on_tool_start / _on_tool_end
    TestSession->>TestSession: record FunctionCallEvent & FunctionCallOutputEvent
    TestSession->>Conv: append assistant message
    TestSession-->>User: return TestResponse (events, output, duration, judge)
    User->>TestResponse: .function_called / .function_output / .judge
    alt .judge uses JudgeLLM
      TestResponse->>Judge: evaluate_intent(message, intent)
      Judge-->>TestResponse: PASS/FAIL verdict
    end
    User->>TestSession: __aexit__/close()
    TestSession->>EventMgr: unsubscribe events
    TestSession->>LLM: restore original instructions

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

The verdict opens like a slit — a single line,
PASS or FAIL, a cold incision in the script;
mocked hands replace the living ones, return like ash,
events stacked and named, small coffins on the page,
the test exhales: a precise, inevitable absence.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add text-only testing framework for agents' accurately summarizes the main objective of this PR, which introduces a comprehensive testing framework for conversational agents.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch testing-framework

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Switch test imports from internal modules (_events, _run_result, _mock_tools) to the public vision_agents.testing API. Replace direct _advance_to_type() call with function_output() in test_explicit_output_check.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (14)

examples/00_example/agent.py (1)
1-4: Add module-level logger

The file is missing a logging import and logger = logging.getLogger(__name__), which is required for all Python modules per the coding guidelines.
♻️ Proposed addition
+import logging
+
 from dotenv import load_dotenv
 from vision_agents.core import Agent, AgentLauncher, User, Runner
 from vision_agents.plugins import getstream, gemini
+
+logger = logging.getLogger(__name__)
As per coding guidelines: "Use module-level logger = logging.getLogger(__name__)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/agent.py` around lines 1 - 4, Add a module-level logger
by importing the logging module and creating logger =
logging.getLogger(__name__) at the top of the module (near the other imports) so
this file (which defines/uses Agent, AgentLauncher, User, Runner and imports
getstream/gemini) follows the project coding guidelines for module-level
logging.
examples/01_simple_agent_example/simple_agent_example.py (1)
2-2: Replace deprecated Dict with built-in dict

Dict from typing is deprecated since Python 3.9. The guideline requires modern generic syntax.
♻️ Proposed fix
-from typing import Any, Dict
+from typing import Any
-    async def get_weather(location: str) -> Dict[str, Any]:
+    async def get_weather(location: str) -> dict[str, Any]:
As per coding guidelines: "Use modern syntax: ... dict[str, T] generics".

Also applies to: 38-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/01_simple_agent_example/simple_agent_example.py` at line 2, Replace
the deprecated typing.Dict usage with the built-in generic dict: update the
import line to remove Dict (keep Any if used) and change type annotations like
Dict[...] to dict[...] (e.g., in simple_agent_example.py replace any Dict[str,
Any] or Dict[...] at the import and at the usage on line 38 with dict[str, Any]
or the appropriate dict[...] form); ensure all occurrences of Dict are removed
from imports and replaced in annotations while preserving Any and other types.
tests/test_testing/test_eval.py (1)
168-172: Prefer the public API over direct _cursor mutation

response._cursor = 1 reaches into the private state of TestResponse. Use the public assertion API to advance the cursor instead, which also makes the test's intent clearer.
♻️ Proposed refactor
-    def test_pass_at_end(self):
+    async def test_pass_at_end(self):
         response = _make_response(_simple_events())
-        response._cursor = 1
+        await response.judge()  # consumes the only event, advances cursor past it
         response.no_more_events()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_testing/test_eval.py` around lines 168 - 172, Replace the direct
private-state mutation response._cursor = 1 with the public API that advances or
asserts consumption on TestResponse; instead of setting _cursor directly in
test_pass_at_end, call the appropriate public method on response (for example
response.consume_event() or response.assert_events_consumed(1) /
response.advance(n) depending on the available API) so the test uses
_make_response/_simple_events and response.no_more_events() without touching
private attributes.
agents-core/vision_agents/testing/_run_result.py (3)
220-232: _format_event for ChatMessageEvent truncates without _truncate(), losing the ... suffix.

Line 223 uses event.content[:_PREVIEW_MAX_LEN] directly, while _format_event for FunctionCallOutputEvent (line 230) delegates to _truncate() which appends "..." when the text exceeds the limit. Using _truncate consistently avoids confusing debug output where a long chat message is silently clipped without any visual indicator.
♻️ Suggested fix
     if isinstance(event, ChatMessageEvent):
-        preview = event.content[:_PREVIEW_MAX_LEN].replace("\n", "\\n")
+        preview = TestResponse._truncate(
+            event.content.replace("\n", "\\n")
+        )
         return f"ChatMessageEvent(role='{event.role}', content='{preview}')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 220 - 232, The
ChatMessageEvent branch in TestResponse._format_event currently slices
event.content directly (event.content[:_PREVIEW_MAX_LEN]) which silently
truncates without the "..." suffix; change it to call
TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the
truncated result) so that long chat messages get the same "..." indicator as
FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event
to use TestResponse._truncate and preserve event.role and newline escaping.
195-206: _advance_to_type silently skips non-matching events — document this behavior.

The while loop on line 198 advances past any event that doesn't match expected_type. If a user calls function_called() but a ChatMessageEvent precedes the FunctionCallEvent, it is silently consumed. This is consistent with the cursor-based design, but a brief note in the class docstring (or the method docstring) would help users understand they cannot "go back" to skipped events.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 195 - 206, The
_advance_to_type method silently consumes non-matching events while advancing
the internal cursor, which can cause earlier events (e.g., a ChatMessageEvent)
to be skipped before a later FunctionCallEvent; update the documentation to make
this behavior explicit by adding a sentence to the class docstring or the
_advance_to_type docstring that states the method advances the cursor forward,
skips any events that don't match expected_type, and that skipped events cannot
be revisited (no backtracking), and reference the method name _advance_to_type
and the cursor semantics (self._cursor) so callers know to check event order
before calling helpers like function_called().
24-40: _judge_llm typed as Any — consider LLM | None for type safety.

The field is always either an LLM instance or None. Typing it as Any suppresses type-checker feedback on misuse. If avoiding a circular import is the concern, a TYPE_CHECKING-guarded import would let you annotate precisely.

As per coding guidelines, "Use type annotations everywhere. Use modern syntax: X | Y unions, dict[str, T] generics".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 24 - 40, The
_judge_llm field is currently typed as Any—change it to a precise LLM | None
annotation to restore type-safety: add a TYPE_CHECKING-guarded import (from
typing import TYPE_CHECKING; if TYPE_CHECKING: from <appropriate_module> import
LLM) to avoid circular imports, then update the TestResponse field declaration
from "_judge_llm: Any = field(default=None, repr=False)" to "_judge_llm: LLM |
None = field(default=None, repr=False)" (keeping default and repr settings
intact) so type checkers see the correct type while runtime imports remain safe.
agents-core/vision_agents/testing/_session.py (3)
1-15: Imports that serve only type annotations could be guarded under TYPE_CHECKING.

EventManager (line 5) and RunEvent (line 13) are used solely for type hints. Per coding guidelines, use the TYPE_CHECKING guard for imports only needed by type annotations. This would require adding from __future__ import annotations to defer evaluation.

As per coding guidelines, "Use TYPE_CHECKING guard for imports only needed by type annotations".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 1 - 15, Imports
used only for type annotations (EventManager and RunEvent) should be guarded by
TYPE_CHECKING and defer evaluation: add "from __future__ import annotations" at
the top, import "from typing import TYPE_CHECKING", then move the EventManager
and RunEvent imports into an "if TYPE_CHECKING:" block; update any references to
those symbols (EventManager, RunEvent) so they remain as forward-referenced
types. Ensure runtime behavior is unchanged and only annotation-only imports are
moved.
125-126: Hardcoded 5-second event wait timeout.

await self._event_manager.wait(timeout=5.0) uses a fixed timeout that may be too tight for slow LLM providers or unnecessarily long for fast mocks. Consider making it configurable via __init__ or simple_response parameter with a sensible default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 125 - 126, The
hardcoded 5.0s in await self._event_manager.wait(timeout=5.0) should be made
configurable: add an event wait timeout parameter (e.g., event_wait_timeout:
float = 5.0) to the class __init__ (or to the simple_response entry point if
more appropriate), store it as self._event_wait_timeout, and replace the literal
5.0 in the _event_manager.wait call with self._event_wait_timeout; update any
instantiations/tests that rely on the previous behavior to pass a shorter
timeout for mocks or leave the default for real providers.
76-85: close() does not reset _conversation or _event_manager, preventing clean restart.

After close(), _started is False, so start() can run again. But start() only runs when not self._started, and it unconditionally reassigns _conversation and _event_manager. So re-entry works — no bug here. However, holding references to stale objects after close may keep resources alive longer than necessary.
♻️ Optional cleanup
     async def close(self) -> None:
         """Clean up resources."""
         if not self._started:
             return
 
         if self._event_manager is not None:
             self._event_manager.unsubscribe(self._on_tool_start)
             self._event_manager.unsubscribe(self._on_tool_end)
 
+        self._event_manager = None
+        self._conversation = None
+        self._captured_events.clear()
         self._started = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 76 - 85, The
close() method unsubscribes handlers and flips _started but leaves references to
_conversation and _event_manager, which can keep resources alive; update close()
to also set self._conversation = None and self._event_manager = None (after
unsubscribing _on_tool_start/_on_tool_end) so the session releases stale objects
and can fully clean up before a restart; retain existing unsubscribe logic for
_on_tool_start and _on_tool_end and only nullify after those calls.
examples/00_example/test_agent.py (4)
30-44: Repetitive LLM/judge instantiation across tests — consider a fixture.

Every test creates gemini.LLM(MODEL) and gemini.LLM(MODEL) identically. A function-scoped fixture would DRY this up:
♻️ Sketch
`@pytest.fixture`
def llm():
    return gemini.LLM(MODEL)

`@pytest.fixture`
def judge_llm():
    return gemini.LLM(MODEL)
Also applies to: 47-63, 66-82, 85-105, 108-139, 142-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 30 - 44, Multiple tests
repeatedly instantiate gemini.LLM(MODEL); introduce pytest fixtures to DRY this
up by creating a function-scoped fixture (e.g., llm) that returns
gemini.LLM(MODEL) and another fixture (e.g., judge_llm) for the judge LLM, then
update tests like test_greeting to accept llm and judge_llm as parameters and
pass those into TestEval instead of constructing gemini.LLM(MODEL) inline;
ensure fixtures are imported/defined in the test module so all tests that use
TestEval (and other tests noted) reuse the fixtures.
129-131: The generator-throw lambda is fragile and hard to read.

lambda location: (_ for _ in ()).throw(RuntimeError(...)) is a clever trick but obscure. A plain async def is clearer and also matches the async signature of the original tool, avoiding any sync/async mismatch:
♻️ Suggested alternative
-        with mock_tools(
-            llm,
-            {
-                "get_weather": lambda location: (_ for _ in ()).throw(
-                    RuntimeError("Service unavailable")
-                )
-            },
-        ):
+        async def _failing_weather(location: str) -> dict[str, str]:
+            raise RuntimeError("Service unavailable")
+
+        with mock_tools(llm, {"get_weather": _failing_weather}):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 129 - 131, Replace the
obscure generator-throw lambda used for "get_weather" with an async function
that simply raises RuntimeError to match the original tool's async signature;
locate the "get_weather" entry in the test agent setup (the lambda: (_ for _ in
()).throw(RuntimeError("Service unavailable"))) and convert it to an async def
get_weather(...) that raises RuntimeError("Service unavailable") so the stub is
readable and has the correct async behavior.
25-27: Consider using a pytest fixture or skipUnless for the API key check.

Every test calls _skip_if_no_key() manually. A session-scoped fixture or pytest.mark.skipif at module level would reduce repetition and ensure the skip can't be accidentally omitted in a new test.
♻️ Suggested alternative
+@pytest.fixture(autouse=True)
+def _require_google_api_key():
+    if not os.getenv("GOOGLE_API_KEY"):
+        pytest.skip("GOOGLE_API_KEY not set")
+
-def _skip_if_no_key():
-    if not os.getenv("GOOGLE_API_KEY"):
-        pytest.skip("GOOGLE_API_KEY not set")
Then remove the _skip_if_no_key() calls from each test body.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 25 - 27, Replace the manual
_skip_if_no_key() calls with a centralized pytest skip so tests can't forget to
check the env var: either add a module-level pytestmark = pytest.mark.skipif(not
os.getenv("GOOGLE_API_KEY"), reason="GOOGLE_API_KEY not set") or create a
session-scoped fixture (e.g., require_google_api_key) that checks
os.getenv("GOOGLE_API_KEY") and calls pytest.skip(...) if missing, then remove
all calls to _skip_if_no_key() from individual test functions; keep references
to the existing helper name _skip_if_no_key in case you want to
deprecate/redirect it to the fixture for backwards compatibility.
93-94: Use parameterized generic dict in return type annotations.

The coding guidelines require modern type syntax with full generics. dict should be dict[str, Any] (or more specific) for the return types.

As per coding guidelines, "Use modern syntax: X | Y unions, dict[str, T] generics, full Callable signatures".

Also applies to: 118-119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 93 - 94, The return type for
get_weather uses an unparameterized dict; update its annotation to a
parameterized generic such as dict[str, Any] (or a more specific mapping like
dict[str, int | str]) and import Any from typing; also find any other functions
in this file that currently return bare dict (e.g., the later async function
returning a dict) and update their return annotations similarly to use dict[str,
Any] or a more specific type to follow the modern typing guidelines.
agents-core/vision_agents/testing/_judge.py (1)
59-60: Add public getter for instructions to the LLM class to avoid accessing private _instructions directly.

Line 59 reads llm._instructions (a private attribute) and stores it to restore later. The LLM class provides set_instructions() as a public method but has no getter—requiring this code to reach into internal state. Add a read-only property instructions to the LLM base class:
`@property`
def instructions(self) -> str:
    """Get the current instructions."""
    return self._instructions
Then replace line 59 with original_instructions = llm.instructions. This maintains the same semantics while respecting encapsulation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_judge.py` around lines 59 - 60, Add a
read-only public property instructions to the LLM class that returns the
internal _instructions (i.e., implement property instructions -> return
self._instructions) and then update the caller in _judge.py to use
llm.instructions instead of accessing llm._instructions directly; keep existing
set_instructions(...) behavior intact so callers like
set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is
obtained via the new instructions property.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 44-57: The loop that validates and swaps functions mutates
registry._functions before entering the try/finally, so a KeyError halfway
leaves some tools swapped; change mock_tools to first validate all tool names
against registry._functions (using func_def = registry._functions.get(tool_name)
and raising KeyError if any missing) and only after successful validation
populate originals and perform the swaps (assign func_def.function = mock_fn)
inside the try block so the finally restoration loop (iterating originals and
resetting func_def.function = original_fn) always runs; ensure you still use the
same identifiers (registry._functions, originals, func_def.function, mocks) so
existing logic and the restoration in the finally block remain unchanged.

In `@tests/test_testing/test_mock_tools.py`:
- Around line 9-13: The test stub _FakeLLM subclasses LLM but doesn't implement
all abstract methods and simple_response lacks type annotations; update it so it
either (A) becomes a plain helper function (e.g., fake_simple_response(text: str
= "", **kwargs) -> LLMResponseEvent) used by tests instead of subclassing LLM,
or (B) fully implements all abstract LLM methods and add proper type annotations
to simple_response as async def simple_response(self, text: str = "", **kwargs)
-> LLMResponseEvent so the class can be instantiated; reference _FakeLLM, LLM,
simple_response, and LLMResponseEvent when making the change.

---

Nitpick comments:
In `@agents-core/vision_agents/testing/_judge.py`:
- Around line 59-60: Add a read-only public property instructions to the LLM
class that returns the internal _instructions (i.e., implement property
instructions -> return self._instructions) and then update the caller in
_judge.py to use llm.instructions instead of accessing llm._instructions
directly; keep existing set_instructions(...) behavior intact so callers like
set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is
obtained via the new instructions property.

In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 220-232: The ChatMessageEvent branch in TestResponse._format_event
currently slices event.content directly (event.content[:_PREVIEW_MAX_LEN]) which
silently truncates without the "..." suffix; change it to call
TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the
truncated result) so that long chat messages get the same "..." indicator as
FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event
to use TestResponse._truncate and preserve event.role and newline escaping.
- Around line 195-206: The _advance_to_type method silently consumes
non-matching events while advancing the internal cursor, which can cause earlier
events (e.g., a ChatMessageEvent) to be skipped before a later
FunctionCallEvent; update the documentation to make this behavior explicit by
adding a sentence to the class docstring or the _advance_to_type docstring that
states the method advances the cursor forward, skips any events that don't match
expected_type, and that skipped events cannot be revisited (no backtracking),
and reference the method name _advance_to_type and the cursor semantics
(self._cursor) so callers know to check event order before calling helpers like
function_called().
- Around line 24-40: The _judge_llm field is currently typed as Any—change it to
a precise LLM | None annotation to restore type-safety: add a
TYPE_CHECKING-guarded import (from typing import TYPE_CHECKING; if
TYPE_CHECKING: from <appropriate_module> import LLM) to avoid circular imports,
then update the TestResponse field declaration from "_judge_llm: Any =
field(default=None, repr=False)" to "_judge_llm: LLM | None =
field(default=None, repr=False)" (keeping default and repr settings intact) so
type checkers see the correct type while runtime imports remain safe.

In `@agents-core/vision_agents/testing/_session.py`:
- Around line 1-15: Imports used only for type annotations (EventManager and
RunEvent) should be guarded by TYPE_CHECKING and defer evaluation: add "from
__future__ import annotations" at the top, import "from typing import
TYPE_CHECKING", then move the EventManager and RunEvent imports into an "if
TYPE_CHECKING:" block; update any references to those symbols (EventManager,
RunEvent) so they remain as forward-referenced types. Ensure runtime behavior is
unchanged and only annotation-only imports are moved.
- Around line 125-126: The hardcoded 5.0s in await
self._event_manager.wait(timeout=5.0) should be made configurable: add an event
wait timeout parameter (e.g., event_wait_timeout: float = 5.0) to the class
__init__ (or to the simple_response entry point if more appropriate), store it
as self._event_wait_timeout, and replace the literal 5.0 in the
_event_manager.wait call with self._event_wait_timeout; update any
instantiations/tests that rely on the previous behavior to pass a shorter
timeout for mocks or leave the default for real providers.
- Around line 76-85: The close() method unsubscribes handlers and flips _started
but leaves references to _conversation and _event_manager, which can keep
resources alive; update close() to also set self._conversation = None and
self._event_manager = None (after unsubscribing _on_tool_start/_on_tool_end) so
the session releases stale objects and can fully clean up before a restart;
retain existing unsubscribe logic for _on_tool_start and _on_tool_end and only
nullify after those calls.

In `@examples/00_example/agent.py`:
- Around line 1-4: Add a module-level logger by importing the logging module and
creating logger = logging.getLogger(__name__) at the top of the module (near the
other imports) so this file (which defines/uses Agent, AgentLauncher, User,
Runner and imports getstream/gemini) follows the project coding guidelines for
module-level logging.

In `@examples/00_example/test_agent.py`:
- Around line 30-44: Multiple tests repeatedly instantiate gemini.LLM(MODEL);
introduce pytest fixtures to DRY this up by creating a function-scoped fixture
(e.g., llm) that returns gemini.LLM(MODEL) and another fixture (e.g., judge_llm)
for the judge LLM, then update tests like test_greeting to accept llm and
judge_llm as parameters and pass those into TestEval instead of constructing
gemini.LLM(MODEL) inline; ensure fixtures are imported/defined in the test
module so all tests that use TestEval (and other tests noted) reuse the
fixtures.
- Around line 129-131: Replace the obscure generator-throw lambda used for
"get_weather" with an async function that simply raises RuntimeError to match
the original tool's async signature; locate the "get_weather" entry in the test
agent setup (the lambda: (_ for _ in ()).throw(RuntimeError("Service
unavailable"))) and convert it to an async def get_weather(...) that raises
RuntimeError("Service unavailable") so the stub is readable and has the correct
async behavior.
- Around line 25-27: Replace the manual _skip_if_no_key() calls with a
centralized pytest skip so tests can't forget to check the env var: either add a
module-level pytestmark = pytest.mark.skipif(not os.getenv("GOOGLE_API_KEY"),
reason="GOOGLE_API_KEY not set") or create a session-scoped fixture (e.g.,
require_google_api_key) that checks os.getenv("GOOGLE_API_KEY") and calls
pytest.skip(...) if missing, then remove all calls to _skip_if_no_key() from
individual test functions; keep references to the existing helper name
_skip_if_no_key in case you want to deprecate/redirect it to the fixture for
backwards compatibility.
- Around line 93-94: The return type for get_weather uses an unparameterized
dict; update its annotation to a parameterized generic such as dict[str, Any]
(or a more specific mapping like dict[str, int | str]) and import Any from
typing; also find any other functions in this file that currently return bare
dict (e.g., the later async function returning a dict) and update their return
annotations similarly to use dict[str, Any] or a more specific type to follow
the modern typing guidelines.

In `@examples/01_simple_agent_example/simple_agent_example.py`:
- Line 2: Replace the deprecated typing.Dict usage with the built-in generic
dict: update the import line to remove Dict (keep Any if used) and change type
annotations like Dict[...] to dict[...] (e.g., in simple_agent_example.py
replace any Dict[str, Any] or Dict[...] at the import and at the usage on line
38 with dict[str, Any] or the appropriate dict[...] form); ensure all
occurrences of Dict are removed from imports and replaced in annotations while
preserving Any and other types.

In `@tests/test_testing/test_eval.py`:
- Around line 168-172: Replace the direct private-state mutation
response._cursor = 1 with the public API that advances or asserts consumption on
TestResponse; instead of setting _cursor directly in test_pass_at_end, call the
appropriate public method on response (for example response.consume_event() or
response.assert_events_consumed(1) / response.advance(n) depending on the
available API) so the test uses _make_response/_simple_events and
response.no_more_events() without touching private attributes.

agents-core/vision_agents/testing/_mock_tools.py

tests/test_testing/test_mock_tools.py

_FakeLLM is a fake (working substitute with simplified logic), not a mock (call recorder with verification). The "never mock" guideline refers to unittest.mock / mock.patch, not test fakes. No change needed there, but the missing type annotations were a valid gap.

When mocks contained a valid tool followed by an unregistered one, the valid tool was already replaced before the KeyError was raised. Since the error happened before the try block, finally never ran and the LLM was left in a permanently mutated state. Fix: validate all tool names before swapping any implementations.

coderabbitai

🧹 Nitpick comments (1)

agents-core/vision_agents/testing/_mock_tools.py (1)

47-55: Consider moving the swap loop inside the try block for belt-and-suspenders safety.

If an unforeseen error were raised mid-swap (line 52), the finally block on line 56 would never execute, leaving already-swapped tools unreachable. Moving the swap loop inside try means any partially-populated originals dict would still be cleaned up by finally. The practical risk is near-zero, but the fix is trivial and makes the guarantee airtight.

♻️ Proposed restructure

     originals: dict[str, Callable[..., Any]] = {}

-    for tool_name, mock_fn in mocks.items():
-        func_def = registry._functions[tool_name]
-        originals[tool_name] = func_def.function
-        func_def.function = mock_fn
-
     try:
+        for tool_name, mock_fn in mocks.items():
+            func_def = registry._functions[tool_name]
+            originals[tool_name] = func_def.function
+            func_def.function = mock_fn
+
         yield
     finally:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_mock_tools.py` around lines 47 - 55, The
swap loop that replaces registry._functions[tool_name].function with mock_fns
should be moved inside the try block so that the finally cleanup always runs
even if an exception occurs mid-swap; specifically, in the context manager where
originals: dict[str, Callable[..., Any]] = {} and you iterate over
mocks.items(), perform the lookup of func_def, store originals[tool_name] =
func_def.function and assign func_def.function = mock_fn inside the try before
yielding, and keep the existing finally block to restore originals from
originals[tool_name] — this ensures partially-swapped entries are tracked and
will be restored.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 47-55: The swap loop that replaces
registry._functions[tool_name].function with mock_fns should be moved inside the
try block so that the finally cleanup always runs even if an exception occurs
mid-swap; specifically, in the context manager where originals: dict[str,
Callable[..., Any]] = {} and you iterate over mocks.items(), perform the lookup
of func_def, store originals[tool_name] = func_def.function and assign
func_def.function = mock_fn inside the try before yielding, and keep the
existing finally block to restore originals from originals[tool_name] — this
ensures partially-swapped entries are tracked and will be restored.

coderabbitai

🧹 Nitpick comments (2)

agents-core/vision_agents/testing/_mock_tools.py (2)
49-56: Move the swap loop inside the try block for complete restoration coverage.

The swap loop runs before try, so an unexpected exception during func_def.function = mock_fn (e.g., a read-only property on an unusual func_def type) would leave already-swapped tools unrestored. In practice, attribute assignment on a mutable object is infallible, but moving the loop inside the try block eliminates the residual theoretical gap at no cost.
♻️ Proposed refactor
     originals: dict[str, Callable[..., Any]] = {}

+    try:
         for tool_name, mock_fn in mocks.items():
             func_def = registry._functions[tool_name]
             originals[tool_name] = func_def.function
             func_def.function = mock_fn
 
-    try:
         yield
     finally:
         for tool_name, original_fn in originals.items():
             registry._functions[tool_name].function = original_fn
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_mock_tools.py` around lines 49 - 56, Move
the swap loop that iterates over mocks.items() into the try block so that
originals are recorded and replacements applied inside the try scope (using
registry._functions, func_def.function, and originals dict) and the finally
block can reliably restore originals even if an exception occurs during
assignment; specifically, perform the for tool_name, mock_fn in mocks.items():
lookup func_def = registry._functions[tool_name], set originals[tool_name] =
func_def.function, then set func_def.function = mock_fn all inside the try,
leaving the finally to iterate originals and restore func_def.function.
41-52: Use registry.get_function(name) instead of directly accessing registry._functions.

Lines 44, 50, and 58 reach into the private _functions dict. The FunctionRegistry class provides a public get_function(name) method that should be used instead. While .function is a public mutable attribute on FunctionDefinition and _ExplicitSchemaFunction, the indirect private dict access creates coupling: if FunctionRegistry ever restructures its internal storage, this breaks silently.

For a more robust design, consider adding public methods like swap_implementation(name, fn) and restore_implementation(name, fn) to FunctionRegistry to decouple mocking logic from internal implementation details.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_mock_tools.py` around lines 41 - 52, The
code currently reaches into registry._functions to look up and replace
FunctionDefinition implementations; replace those direct accesses with the
public API by calling registry.get_function(tool_name) to retrieve the
FunctionDefinition (or _ExplicitSchemaFunction) and check for None, then
read/assign the .function attribute as before; update the lookup in the
validation loop and the swap loop (where originals: dict[...] is populated and
func_def.function = mock_fn occurs). Optionally, if FunctionRegistry supports or
you add methods like swap_implementation(name, fn) and
restore_implementation(name, fn), prefer calling those instead of mutating
.function directly to decouple mocking logic from internal storage.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 49-56: Move the swap loop that iterates over mocks.items() into
the try block so that originals are recorded and replacements applied inside the
try scope (using registry._functions, func_def.function, and originals dict) and
the finally block can reliably restore originals even if an exception occurs
during assignment; specifically, perform the for tool_name, mock_fn in
mocks.items(): lookup func_def = registry._functions[tool_name], set
originals[tool_name] = func_def.function, then set func_def.function = mock_fn
all inside the try, leaving the finally to iterate originals and restore
func_def.function.
- Around line 41-52: The code currently reaches into registry._functions to look
up and replace FunctionDefinition implementations; replace those direct accesses
with the public API by calling registry.get_function(tool_name) to retrieve the
FunctionDefinition (or _ExplicitSchemaFunction) and check for None, then
read/assign the .function attribute as before; update the lookup in the
validation loop and the swap loop (where originals: dict[...] is populated and
func_def.function = mock_fn occurs). Optionally, if FunctionRegistry supports or
you add methods like swap_implementation(name, fn) and
restore_implementation(name, fn), prefer calling those instead of mutating
.function directly to decouple mocking logic from internal storage.

TestEval → TestSession: the class manages a session, not an evaluation. Fix _FakeLLM.simple_response to match the base LLM signature. Add "never use from __future__ import annotations" rule to CLAUDE.md.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

agents-core/vision_agents/testing/_session.py (1)

50-55: __aenter__/__aexit__ belong after private helpers per the project's method-order guideline.

The guideline specifies: __init__ → public lifecycle → properties → public feature methods → private helpers → dunder methods. Currently __aenter__/__aexit__ are placed between __init__ and start/close, which inverts that order.

♻️ Proposed reordering

     def __init__(self, ...) -> None:
         ...
 
-    async def __aenter__(self) -> "TestSession":
-        await self.start()
-        return self
-
-    async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:
-        await self.close()
-
     async def start(self) -> None:
         ...

     async def close(self) -> None:
         ...

     `@property`
     def llm(self) -> LLM:
         ...

     async def simple_response(self, text: str) -> TestResponse:
         ...

     async def _on_tool_start(self, event: ToolStartEvent) -> None:
         ...

     async def _on_tool_end(self, event: ToolEndEvent) -> None:
         ...

+    async def __aenter__(self) -> "TestSession":
+        await self.start()
+        return self
+
+    async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:
+        await self.close()

As per coding guidelines: "Order class methods as: __init__, public lifecycle methods, properties, public feature methods, private helpers, dunder methods."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 50 - 55, Move the
async dunder context managers __aenter__ and __aexit__ to after the class's
private helper methods so they follow the project's method-order guideline;
locate the TestSession class methods __aenter__ and __aexit__ and cut them from
their current position (now before start and close) and paste them below the
private helpers section (after any methods like _<private_helper_name>),
ensuring start and close remain in the public lifecycle area and the dunder
methods appear last.

agents-core/vision_agents/testing/_run_result.py (1)

220-224: _format_event truncates ChatMessageEvent content without the ellipsis marker.

Line 223 slices directly ([:_PREVIEW_MAX_LEN]), which silently drops the rest of the string with no indication. Line 230 (the FunctionCallOutputEvent branch) correctly uses _truncate(), which appends "...". Debug output for assistant messages will appear untruncated when they happen to hit the limit exactly, making failure context harder to read.
♻️ Proposed fix — use `_truncate` for consistency
         if isinstance(event, ChatMessageEvent):
-            preview = event.content[:_PREVIEW_MAX_LEN].replace("\n", "\\n")
+            preview = TestResponse._truncate(event.content).replace("\n", "\\n")
             return f"ChatMessageEvent(role='{event.role}', content='{preview}')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 220 - 224, The
ChatMessageEvent branch in _format_event currently slices content with
event.content[:_PREVIEW_MAX_LEN] which silently drops text; update it to use the
existing _truncate helper (same behavior as FunctionCallOutputEvent) so preview
= _truncate(event.content, _PREVIEW_MAX_LEN). Keep the replace("\n", "\\n") and
the f-string return unchanged, just build preview via _truncate to ensure an
ellipsis is appended when truncated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 97-103: The current argument comparison uses
event.arguments.get(key) which returns None for missing keys, so
arguments={"city": None} will falsely pass when "city" is absent; update the
check in the arguments loop (the block that currently calls
event.arguments.get(key) and then compares actual != value) to first test key
presence (e.g., if key not in event.arguments) and raise via
_raise_with_debug_info for missing keys, otherwise compare event.arguments[key]
to the expected value and call _raise_with_debug_info on mismatch; keep using
the existing _raise_with_debug_info to surface failures.

In `@agents-core/vision_agents/testing/_session.py`:
- Around line 112-129: The _capturing flag is set before calling
self._conversation.send_message but the try/finally that clears it only
surrounds the LLM call, so if send_message raises _capturing stays True; fix by
expanding the try/finally to begin immediately after
self._captured_events.clear() (or by placing send_message inside the existing
try) so that any exception from self._conversation.send_message or await
self._llm.simple_response will always execute the finally that sets
self._capturing = False; keep the existing behavior of awaiting
self._event_manager.wait when present and ensure _captured_events is still
cleared at the start of the operation.

---

Nitpick comments:
In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 220-224: The ChatMessageEvent branch in _format_event currently
slices content with event.content[:_PREVIEW_MAX_LEN] which silently drops text;
update it to use the existing _truncate helper (same behavior as
FunctionCallOutputEvent) so preview = _truncate(event.content,
_PREVIEW_MAX_LEN). Keep the replace("\n", "\\n") and the f-string return
unchanged, just build preview via _truncate to ensure an ellipsis is appended
when truncated.

In `@agents-core/vision_agents/testing/_session.py`:
- Around line 50-55: Move the async dunder context managers __aenter__ and
__aexit__ to after the class's private helper methods so they follow the
project's method-order guideline; locate the TestSession class methods
__aenter__ and __aexit__ and cut them from their current position (now before
start and close) and paste them below the private helpers section (after any
methods like _<private_helper_name>), ensuring start and close remain in the
public lifecycle area and the dunder methods appear last.

coderabbitai · 2026-02-22T20:57:57Z

agents-core/vision_agents/testing/_run_result.py

+        if arguments is not None:
+            for key, value in arguments.items():
+                actual = event.arguments.get(key)
+                if actual != value:
+                    self._raise_with_debug_info(
+                        f"For argument '{key}', expected {value!r}, got {actual!r}"
+                    )


⚠️ Potential issue | 🟡 Minor

Partial-match silently passes when the expected value is None and the key is absent.

event.arguments.get(key) returns None for a missing key. If a test asserts arguments={"city": None} but the actual call doesn't include "city" at all, actual == None == value and no failure is raised — masking a genuine argument mismatch.

🐛 Proposed fix — check key presence separately

for key, value in arguments.items(): - actual = event.arguments.get(key) - if actual != value: + if key not in event.arguments: + self._raise_with_debug_info( + f"Argument '{key}' not present in actual call arguments {list(event.arguments.keys())}" + ) + if event.arguments[key] != value: self._raise_with_debug_info( f"For argument '{key}', expected {value!r}, got {actual!r}" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if arguments is not None:

for key, value in arguments.items():

actual = event.arguments.get(key)

if actual != value:

self._raise_with_debug_info(

f"For argument '{key}', expected {value!r}, got {actual!r}"

)

if arguments is not None:

for key, value in arguments.items():

if key not in event.arguments:

self._raise_with_debug_info(

f"Argument '{key}' not present in actual call arguments {list(event.arguments.keys())}"

)

actual = event.arguments[key]

if actual != value:

self._raise_with_debug_info(

f"For argument '{key}', expected {value!r}, got {actual!r}"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 97 - 103, The current argument comparison uses event.arguments.get(key) which returns None for missing keys, so arguments={"city": None} will falsely pass when "city" is absent; update the check in the arguments loop (the block that currently calls event.arguments.get(key) and then compares actual != value) to first test key presence (e.g., if key not in event.arguments) and raise via _raise_with_debug_info for missing keys, otherwise compare event.arguments[key] to the expected value and call _raise_with_debug_info on mismatch; keep using the existing _raise_with_debug_info to surface failures.

coderabbitai · 2026-02-22T20:57:57Z

agents-core/vision_agents/testing/_session.py

+        self._captured_events.clear()
+        self._capturing = True
+
+        if self._conversation is not None:
+            await self._conversation.send_message(
+                role="user",
+                user_id="test-user",
+                content=text,
+            )
+
+        try:
+            response = await self._llm.simple_response(text=text)
+
+            if self._event_manager is not None:
+                await self._event_manager.wait(timeout=5.0)
+
+        finally:
+            self._capturing = False


⚠️ Potential issue | 🟡 Minor

_capturing stays True if send_message raises before the try/finally guard.

self._capturing = True is set at line 113, but the finally: self._capturing = False only covers lines 122–129. If await self._conversation.send_message(...) (lines 116–120) raises, _capturing is never reset. Any subsequent reuse of the session (e.g., after catching the exception) would encounter stale _capturing = True state, and tool events from unrelated coroutines could leak into _captured_events.

🐛 Proposed fix — extend the try/finally to cover send_message

self._captured_events.clear() self._capturing = True + try: if self._conversation is not None: await self._conversation.send_message( role="user", user_id="test-user", content=text, ) - try: response = await self._llm.simple_response(text=text) if self._event_manager is not None: await self._event_manager.wait(timeout=5.0) finally: self._capturing = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 112 - 129, The _capturing flag is set before calling self._conversation.send_message but the try/finally that clears it only surrounds the LLM call, so if send_message raises _capturing stays True; fix by expanding the try/finally to begin immediately after self._captured_events.clear() (or by placing send_message inside the existing try) so that any exception from self._conversation.send_message or await self._llm.simple_response will always execute the finally that sets self._capturing = False; keep the existing behavior of awaiting self._event_manager.wait when present and ensure _captured_events is still cleared at the start of the operation.

aliev added 16 commits February 19, 2026 15:09

WIP: add vision_agents.testing framework

3d3eae3

Text-only testing SDK for Vision-Agents agents. Includes TestSession, fluent assertion API (RunResult/RunAssert), LLM-based judge, mock_tools, unit tests, integration test examples for 00_example and 01_simple_agent_example.

docs: add testing framework API documentation

01d7a99

Covers all public classes, methods, event types, recommended patterns, and architecture rationale for testing at the LLM level.

refactor: TestSession → TestEval with simplified assertion API

c143b89

Replace cursor-based assertion classes (RunAssert, EventAssert, etc.) with scenario-style methods on TestEval: user_says, agent_calls, agent_responds, no_more_events. ~500 lines removed.

rename user_says → simple_response, agent_responds → judge

5551514

Align with core LLM method naming and clarify intent.

move assertions from TestEval into TestResponse

f4736c1

TestEval now only handles lifecycle and LLM communication. TestResponse holds data (output, events, function_calls, duration_ms) and assertion methods (agent_calls, judge, no_more_events).

rename agent_calls → function_called, agent_calls_output → function_o…

d51cfc2

…utput

update API.md for v2 refactoring

52b658d

Reflect current API: TestResponse with function_called/function_output/judge, simple_response returns TestResponse, assertions on response not session.

polish API.md formatting

b7651dc

Add table of contents, align tables, consolidate architecture section, add second quick start example with tool calls.

replace API.md with user-friendly README.md

949039c

apply ruff formatting fixes

1ac1ba8

deduplicate PASS/FAIL parsing in _parse_verdict

ecec3e3

Replace two identical branches with a _VERDICTS mapping lookup, removing duplicated string slicing logic.

refactor _format_events into TestResponse static methods

d64962b

Extract _truncate, _format_event, _format_events as static methods on TestResponse — they only serve _raise_with_debug_info. Move magic numbers into documented module-level constants.

move _VERDICTS to top of _judge.py, extract magic number into constant

7da1605

replace README.md with module docstring in __init__.py

bbe2872

Docs live in a separate repo, so the in-package README is redundant. Move essential usage examples and key exports summary into the module docstring where help() and IDE tooltips can surface them.

github-actions bot added agents-core examples labels Feb 20, 2026

aliev added 4 commits February 20, 2026 20:08

use public API imports in tests, remove private method call

4d345e5

Switch test imports from internal modules (_events, _run_result, _mock_tools) to the public vision_agents.testing API. Replace direct _advance_to_type() call with function_output() in test_explicit_output_check.

apply ruff formatting to _judge.py

c36b71f

remove examples/00_example test scaffold

499f902

use @pytest.mark.integration decorator instead of pytestmark

05c5eb7

aliev changed the title ~~Testing framework~~ Add text-only testing framework for agents Feb 20, 2026

aliev marked this pull request as ready for review February 20, 2026 16:36

aliev requested a review from dangusev February 20, 2026 16:37

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

agents-core/vision_agents/testing/_mock_tools.py Outdated Show resolved Hide resolved

tests/test_testing/test_mock_tools.py Show resolved Hide resolved

aliev added 3 commits February 20, 2026 21:04

apply ruff formatting to test_mock_tools.py

90d043a

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

fix mypy error in mock_tools restore loop

30427c1

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

rename TestEval to TestSession and fix _FakeLLM signature

02909ca

TestEval → TestSession: the class manages a session, not an evaluation. Fix _FakeLLM.simple_response to match the base LLM signature. Add "never use from __future__ import annotations" rule to CLAUDE.md.

github-actions bot added the docs label Feb 22, 2026

apply ruff formatting to test_simple_agent.py

2dafe39

coderabbitai bot reviewed Feb 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add text-only testing framework for agents#364

Add text-only testing framework for agents#364
aliev wants to merge 26 commits intomainfrom
testing-framework

aliev commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 22, 2026

Uh oh!

coderabbitai bot Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

aliev commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What's included

Usage

Design decisions

Files

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aliev commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading