Conversation
Text-only testing SDK for Vision-Agents agents. Includes TestSession, fluent assertion API (RunResult/RunAssert), LLM-based judge, mock_tools, unit tests, integration test examples for 00_example and 01_simple_agent_example.
Covers all public classes, methods, event types, recommended patterns, and architecture rationale for testing at the LLM level.
Replace cursor-based assertion classes (RunAssert, EventAssert, etc.) with scenario-style methods on TestEval: user_says, agent_calls, agent_responds, no_more_events. ~500 lines removed.
Align with core LLM method naming and clarify intent.
TestEval now only handles lifecycle and LLM communication. TestResponse holds data (output, events, function_calls, duration_ms) and assertion methods (agent_calls, judge, no_more_events).
Reflect current API: TestResponse with function_called/function_output/judge, simple_response returns TestResponse, assertions on response not session.
Add table of contents, align tables, consolidate architecture section, add second quick start example with tool calls.
- Remove unused TestSession alias from __init__.py - Remove duplicate docstring, unused logging/os imports from _session.py - Deduplicate _evals_verbose (import from _run_result instead of redefining) - Fix class docstring/__test__ ordering in TestEval and TestResponse - Add -> None return type to _on_tool_start and _on_tool_end - Make _advance_to_type generic via TypeVar, remove redundant assert isinstance() - Remove unreachable RuntimeError, annotate _raise_with_debug_info as NoReturn - Remove dead else branch in _format_events - Replace except Exception with specific exceptions in _judge.py - Remove from __future__ import annotations, use quoted forward refs - Move evaluate_intent import to module level in _run_result.py
Drop VISION_AGENTS_EVALS_VERBOSE env var, associated print() calls, and _evals_verbose flag — premature for v1, avoids documenting and potentially deprecating a public env variable. Remove from __future__ import annotations from _events.py and _mock_tools.py (no forward references, Python 3.10+).
Replace two identical branches with a _VERDICTS mapping lookup, removing duplicated string slicing logic.
Extract _truncate, _format_event, _format_events as static methods on TestResponse — they only serve _raise_with_debug_info. Move magic numbers into documented module-level constants.
Docs live in a separate repo, so the in-package README is redundant. Move essential usage examples and key exports summary into the module docstring where help() and IDE tooltips can surface them.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new vision_agents.testing package providing event dataclasses, a TestSession async test harness, TestResponse assertion helpers, an LLM-based intent judge, a mock_tools context manager, updated example wiring, and unit/integration tests for the testing primitives. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant TestSession as TestSession
participant LLM as LLM
participant EventMgr as EventManager
participant Conv as InMemoryConversation
participant Judge as JudgeLLM
User->>TestSession: __aenter__/start()
TestSession->>LLM: apply test instructions
TestSession->>EventMgr: subscribe ToolStart/ToolEnd
TestSession->>Conv: attach conversation
User->>TestSession: simple_response("text")
TestSession->>LLM: send user message
LLM->>EventMgr: emit ToolStart / ToolEnd
EventMgr->>TestSession: _on_tool_start / _on_tool_end
TestSession->>TestSession: record FunctionCallEvent & FunctionCallOutputEvent
TestSession->>Conv: append assistant message
TestSession-->>User: return TestResponse (events, output, duration, judge)
User->>TestResponse: .function_called / .function_output / .judge
alt .judge uses JudgeLLM
TestResponse->>Judge: evaluate_intent(message, intent)
Judge-->>TestResponse: PASS/FAIL verdict
end
User->>TestSession: __aexit__/close()
TestSession->>EventMgr: unsubscribe events
TestSession->>LLM: restore original instructions
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Switch test imports from internal modules (_events, _run_result, _mock_tools) to the public vision_agents.testing API. Replace direct _advance_to_type() call with function_output() in test_explicit_output_check.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (14)
examples/00_example/agent.py (1)
1-4: Add module-level loggerThe file is missing a
loggingimport andlogger = logging.getLogger(__name__), which is required for all Python modules per the coding guidelines.♻️ Proposed addition
+import logging + from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, User, Runner from vision_agents.plugins import getstream, gemini + +logger = logging.getLogger(__name__)As per coding guidelines: "Use module-level
logger = logging.getLogger(__name__)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/agent.py` around lines 1 - 4, Add a module-level logger by importing the logging module and creating logger = logging.getLogger(__name__) at the top of the module (near the other imports) so this file (which defines/uses Agent, AgentLauncher, User, Runner and imports getstream/gemini) follows the project coding guidelines for module-level logging.examples/01_simple_agent_example/simple_agent_example.py (1)
2-2: Replace deprecatedDictwith built-indict
Dictfromtypingis deprecated since Python 3.9. The guideline requires modern generic syntax.♻️ Proposed fix
-from typing import Any, Dict +from typing import Any- async def get_weather(location: str) -> Dict[str, Any]: + async def get_weather(location: str) -> dict[str, Any]:As per coding guidelines: "Use modern syntax: ...
dict[str, T]generics".Also applies to: 38-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/01_simple_agent_example/simple_agent_example.py` at line 2, Replace the deprecated typing.Dict usage with the built-in generic dict: update the import line to remove Dict (keep Any if used) and change type annotations like Dict[...] to dict[...] (e.g., in simple_agent_example.py replace any Dict[str, Any] or Dict[...] at the import and at the usage on line 38 with dict[str, Any] or the appropriate dict[...] form); ensure all occurrences of Dict are removed from imports and replaced in annotations while preserving Any and other types.tests/test_testing/test_eval.py (1)
168-172: Prefer the public API over direct_cursormutation
response._cursor = 1reaches into the private state ofTestResponse. Use the public assertion API to advance the cursor instead, which also makes the test's intent clearer.♻️ Proposed refactor
- def test_pass_at_end(self): + async def test_pass_at_end(self): response = _make_response(_simple_events()) - response._cursor = 1 + await response.judge() # consumes the only event, advances cursor past it response.no_more_events()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_testing/test_eval.py` around lines 168 - 172, Replace the direct private-state mutation response._cursor = 1 with the public API that advances or asserts consumption on TestResponse; instead of setting _cursor directly in test_pass_at_end, call the appropriate public method on response (for example response.consume_event() or response.assert_events_consumed(1) / response.advance(n) depending on the available API) so the test uses _make_response/_simple_events and response.no_more_events() without touching private attributes.agents-core/vision_agents/testing/_run_result.py (3)
220-232:_format_eventforChatMessageEventtruncates without_truncate(), losing the...suffix.Line 223 uses
event.content[:_PREVIEW_MAX_LEN]directly, while_format_eventforFunctionCallOutputEvent(line 230) delegates to_truncate()which appends"..."when the text exceeds the limit. Using_truncateconsistently avoids confusing debug output where a long chat message is silently clipped without any visual indicator.♻️ Suggested fix
if isinstance(event, ChatMessageEvent): - preview = event.content[:_PREVIEW_MAX_LEN].replace("\n", "\\n") + preview = TestResponse._truncate( + event.content.replace("\n", "\\n") + ) return f"ChatMessageEvent(role='{event.role}', content='{preview}')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 220 - 232, The ChatMessageEvent branch in TestResponse._format_event currently slices event.content directly (event.content[:_PREVIEW_MAX_LEN]) which silently truncates without the "..." suffix; change it to call TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the truncated result) so that long chat messages get the same "..." indicator as FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event to use TestResponse._truncate and preserve event.role and newline escaping.
195-206:_advance_to_typesilently skips non-matching events — document this behavior.The while loop on line 198 advances past any event that doesn't match
expected_type. If a user callsfunction_called()but aChatMessageEventprecedes theFunctionCallEvent, it is silently consumed. This is consistent with the cursor-based design, but a brief note in the class docstring (or the method docstring) would help users understand they cannot "go back" to skipped events.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 195 - 206, The _advance_to_type method silently consumes non-matching events while advancing the internal cursor, which can cause earlier events (e.g., a ChatMessageEvent) to be skipped before a later FunctionCallEvent; update the documentation to make this behavior explicit by adding a sentence to the class docstring or the _advance_to_type docstring that states the method advances the cursor forward, skips any events that don't match expected_type, and that skipped events cannot be revisited (no backtracking), and reference the method name _advance_to_type and the cursor semantics (self._cursor) so callers know to check event order before calling helpers like function_called().
24-40:_judge_llmtyped asAny— considerLLM | Nonefor type safety.The field is always either an
LLMinstance orNone. Typing it asAnysuppresses type-checker feedback on misuse. If avoiding a circular import is the concern, aTYPE_CHECKING-guarded import would let you annotate precisely.As per coding guidelines, "Use type annotations everywhere. Use modern syntax:
X | Yunions,dict[str, T]generics".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 24 - 40, The _judge_llm field is currently typed as Any—change it to a precise LLM | None annotation to restore type-safety: add a TYPE_CHECKING-guarded import (from typing import TYPE_CHECKING; if TYPE_CHECKING: from <appropriate_module> import LLM) to avoid circular imports, then update the TestResponse field declaration from "_judge_llm: Any = field(default=None, repr=False)" to "_judge_llm: LLM | None = field(default=None, repr=False)" (keeping default and repr settings intact) so type checkers see the correct type while runtime imports remain safe.agents-core/vision_agents/testing/_session.py (3)
1-15: Imports that serve only type annotations could be guarded underTYPE_CHECKING.
EventManager(line 5) andRunEvent(line 13) are used solely for type hints. Per coding guidelines, use theTYPE_CHECKINGguard for imports only needed by type annotations. This would require addingfrom __future__ import annotationsto defer evaluation.As per coding guidelines, "Use
TYPE_CHECKINGguard for imports only needed by type annotations".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 1 - 15, Imports used only for type annotations (EventManager and RunEvent) should be guarded by TYPE_CHECKING and defer evaluation: add "from __future__ import annotations" at the top, import "from typing import TYPE_CHECKING", then move the EventManager and RunEvent imports into an "if TYPE_CHECKING:" block; update any references to those symbols (EventManager, RunEvent) so they remain as forward-referenced types. Ensure runtime behavior is unchanged and only annotation-only imports are moved.
125-126: Hardcoded 5-second event wait timeout.
await self._event_manager.wait(timeout=5.0)uses a fixed timeout that may be too tight for slow LLM providers or unnecessarily long for fast mocks. Consider making it configurable via__init__orsimple_responseparameter with a sensible default.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 125 - 126, The hardcoded 5.0s in await self._event_manager.wait(timeout=5.0) should be made configurable: add an event wait timeout parameter (e.g., event_wait_timeout: float = 5.0) to the class __init__ (or to the simple_response entry point if more appropriate), store it as self._event_wait_timeout, and replace the literal 5.0 in the _event_manager.wait call with self._event_wait_timeout; update any instantiations/tests that rely on the previous behavior to pass a shorter timeout for mocks or leave the default for real providers.
76-85:close()does not reset_conversationor_event_manager, preventing clean restart.After
close(),_startedisFalse, sostart()can run again. Butstart()only runs whennot self._started, and it unconditionally reassigns_conversationand_event_manager. So re-entry works — no bug here. However, holding references to stale objects after close may keep resources alive longer than necessary.♻️ Optional cleanup
async def close(self) -> None: """Clean up resources.""" if not self._started: return if self._event_manager is not None: self._event_manager.unsubscribe(self._on_tool_start) self._event_manager.unsubscribe(self._on_tool_end) + self._event_manager = None + self._conversation = None + self._captured_events.clear() self._started = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 76 - 85, The close() method unsubscribes handlers and flips _started but leaves references to _conversation and _event_manager, which can keep resources alive; update close() to also set self._conversation = None and self._event_manager = None (after unsubscribing _on_tool_start/_on_tool_end) so the session releases stale objects and can fully clean up before a restart; retain existing unsubscribe logic for _on_tool_start and _on_tool_end and only nullify after those calls.examples/00_example/test_agent.py (4)
30-44: Repetitive LLM/judge instantiation across tests — consider a fixture.Every test creates
gemini.LLM(MODEL)andgemini.LLM(MODEL)identically. A function-scoped fixture would DRY this up:♻️ Sketch
`@pytest.fixture` def llm(): return gemini.LLM(MODEL) `@pytest.fixture` def judge_llm(): return gemini.LLM(MODEL)Also applies to: 47-63, 66-82, 85-105, 108-139, 142-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 30 - 44, Multiple tests repeatedly instantiate gemini.LLM(MODEL); introduce pytest fixtures to DRY this up by creating a function-scoped fixture (e.g., llm) that returns gemini.LLM(MODEL) and another fixture (e.g., judge_llm) for the judge LLM, then update tests like test_greeting to accept llm and judge_llm as parameters and pass those into TestEval instead of constructing gemini.LLM(MODEL) inline; ensure fixtures are imported/defined in the test module so all tests that use TestEval (and other tests noted) reuse the fixtures.
129-131: The generator-throw lambda is fragile and hard to read.
lambda location: (_ for _ in ()).throw(RuntimeError(...))is a clever trick but obscure. A plainasync defis clearer and also matches the async signature of the original tool, avoiding any sync/async mismatch:♻️ Suggested alternative
- with mock_tools( - llm, - { - "get_weather": lambda location: (_ for _ in ()).throw( - RuntimeError("Service unavailable") - ) - }, - ): + async def _failing_weather(location: str) -> dict[str, str]: + raise RuntimeError("Service unavailable") + + with mock_tools(llm, {"get_weather": _failing_weather}):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 129 - 131, Replace the obscure generator-throw lambda used for "get_weather" with an async function that simply raises RuntimeError to match the original tool's async signature; locate the "get_weather" entry in the test agent setup (the lambda: (_ for _ in ()).throw(RuntimeError("Service unavailable"))) and convert it to an async def get_weather(...) that raises RuntimeError("Service unavailable") so the stub is readable and has the correct async behavior.
25-27: Consider using a pytest fixture orskipUnlessfor the API key check.Every test calls
_skip_if_no_key()manually. A session-scoped fixture orpytest.mark.skipifat module level would reduce repetition and ensure the skip can't be accidentally omitted in a new test.♻️ Suggested alternative
+@pytest.fixture(autouse=True) +def _require_google_api_key(): + if not os.getenv("GOOGLE_API_KEY"): + pytest.skip("GOOGLE_API_KEY not set") + -def _skip_if_no_key(): - if not os.getenv("GOOGLE_API_KEY"): - pytest.skip("GOOGLE_API_KEY not set")Then remove the
_skip_if_no_key()calls from each test body.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 25 - 27, Replace the manual _skip_if_no_key() calls with a centralized pytest skip so tests can't forget to check the env var: either add a module-level pytestmark = pytest.mark.skipif(not os.getenv("GOOGLE_API_KEY"), reason="GOOGLE_API_KEY not set") or create a session-scoped fixture (e.g., require_google_api_key) that checks os.getenv("GOOGLE_API_KEY") and calls pytest.skip(...) if missing, then remove all calls to _skip_if_no_key() from individual test functions; keep references to the existing helper name _skip_if_no_key in case you want to deprecate/redirect it to the fixture for backwards compatibility.
93-94: Use parameterized genericdictin return type annotations.The coding guidelines require modern type syntax with full generics.
dictshould bedict[str, Any](or more specific) for the return types.As per coding guidelines, "Use modern syntax:
X | Yunions,dict[str, T]generics, fullCallablesignatures".Also applies to: 118-119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 93 - 94, The return type for get_weather uses an unparameterized dict; update its annotation to a parameterized generic such as dict[str, Any] (or a more specific mapping like dict[str, int | str]) and import Any from typing; also find any other functions in this file that currently return bare dict (e.g., the later async function returning a dict) and update their return annotations similarly to use dict[str, Any] or a more specific type to follow the modern typing guidelines.agents-core/vision_agents/testing/_judge.py (1)
59-60: Add public getter forinstructionsto theLLMclass to avoid accessing private_instructionsdirectly.Line 59 reads
llm._instructions(a private attribute) and stores it to restore later. TheLLMclass providesset_instructions()as a public method but has no getter—requiring this code to reach into internal state. Add a read-only propertyinstructionsto theLLMbase class:`@property` def instructions(self) -> str: """Get the current instructions.""" return self._instructionsThen replace line 59 with
original_instructions = llm.instructions. This maintains the same semantics while respecting encapsulation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_judge.py` around lines 59 - 60, Add a read-only public property instructions to the LLM class that returns the internal _instructions (i.e., implement property instructions -> return self._instructions) and then update the caller in _judge.py to use llm.instructions instead of accessing llm._instructions directly; keep existing set_instructions(...) behavior intact so callers like set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is obtained via the new instructions property.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 44-57: The loop that validates and swaps functions mutates
registry._functions before entering the try/finally, so a KeyError halfway
leaves some tools swapped; change mock_tools to first validate all tool names
against registry._functions (using func_def = registry._functions.get(tool_name)
and raising KeyError if any missing) and only after successful validation
populate originals and perform the swaps (assign func_def.function = mock_fn)
inside the try block so the finally restoration loop (iterating originals and
resetting func_def.function = original_fn) always runs; ensure you still use the
same identifiers (registry._functions, originals, func_def.function, mocks) so
existing logic and the restoration in the finally block remain unchanged.
In `@tests/test_testing/test_mock_tools.py`:
- Around line 9-13: The test stub _FakeLLM subclasses LLM but doesn't implement
all abstract methods and simple_response lacks type annotations; update it so it
either (A) becomes a plain helper function (e.g., fake_simple_response(text: str
= "", **kwargs) -> LLMResponseEvent) used by tests instead of subclassing LLM,
or (B) fully implements all abstract LLM methods and add proper type annotations
to simple_response as async def simple_response(self, text: str = "", **kwargs)
-> LLMResponseEvent so the class can be instantiated; reference _FakeLLM, LLM,
simple_response, and LLMResponseEvent when making the change.
---
Nitpick comments:
In `@agents-core/vision_agents/testing/_judge.py`:
- Around line 59-60: Add a read-only public property instructions to the LLM
class that returns the internal _instructions (i.e., implement property
instructions -> return self._instructions) and then update the caller in
_judge.py to use llm.instructions instead of accessing llm._instructions
directly; keep existing set_instructions(...) behavior intact so callers like
set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is
obtained via the new instructions property.
In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 220-232: The ChatMessageEvent branch in TestResponse._format_event
currently slices event.content directly (event.content[:_PREVIEW_MAX_LEN]) which
silently truncates without the "..." suffix; change it to call
TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the
truncated result) so that long chat messages get the same "..." indicator as
FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event
to use TestResponse._truncate and preserve event.role and newline escaping.
- Around line 195-206: The _advance_to_type method silently consumes
non-matching events while advancing the internal cursor, which can cause earlier
events (e.g., a ChatMessageEvent) to be skipped before a later
FunctionCallEvent; update the documentation to make this behavior explicit by
adding a sentence to the class docstring or the _advance_to_type docstring that
states the method advances the cursor forward, skips any events that don't match
expected_type, and that skipped events cannot be revisited (no backtracking),
and reference the method name _advance_to_type and the cursor semantics
(self._cursor) so callers know to check event order before calling helpers like
function_called().
- Around line 24-40: The _judge_llm field is currently typed as Any—change it to
a precise LLM | None annotation to restore type-safety: add a
TYPE_CHECKING-guarded import (from typing import TYPE_CHECKING; if
TYPE_CHECKING: from <appropriate_module> import LLM) to avoid circular imports,
then update the TestResponse field declaration from "_judge_llm: Any =
field(default=None, repr=False)" to "_judge_llm: LLM | None =
field(default=None, repr=False)" (keeping default and repr settings intact) so
type checkers see the correct type while runtime imports remain safe.
In `@agents-core/vision_agents/testing/_session.py`:
- Around line 1-15: Imports used only for type annotations (EventManager and
RunEvent) should be guarded by TYPE_CHECKING and defer evaluation: add "from
__future__ import annotations" at the top, import "from typing import
TYPE_CHECKING", then move the EventManager and RunEvent imports into an "if
TYPE_CHECKING:" block; update any references to those symbols (EventManager,
RunEvent) so they remain as forward-referenced types. Ensure runtime behavior is
unchanged and only annotation-only imports are moved.
- Around line 125-126: The hardcoded 5.0s in await
self._event_manager.wait(timeout=5.0) should be made configurable: add an event
wait timeout parameter (e.g., event_wait_timeout: float = 5.0) to the class
__init__ (or to the simple_response entry point if more appropriate), store it
as self._event_wait_timeout, and replace the literal 5.0 in the
_event_manager.wait call with self._event_wait_timeout; update any
instantiations/tests that rely on the previous behavior to pass a shorter
timeout for mocks or leave the default for real providers.
- Around line 76-85: The close() method unsubscribes handlers and flips _started
but leaves references to _conversation and _event_manager, which can keep
resources alive; update close() to also set self._conversation = None and
self._event_manager = None (after unsubscribing _on_tool_start/_on_tool_end) so
the session releases stale objects and can fully clean up before a restart;
retain existing unsubscribe logic for _on_tool_start and _on_tool_end and only
nullify after those calls.
In `@examples/00_example/agent.py`:
- Around line 1-4: Add a module-level logger by importing the logging module and
creating logger = logging.getLogger(__name__) at the top of the module (near the
other imports) so this file (which defines/uses Agent, AgentLauncher, User,
Runner and imports getstream/gemini) follows the project coding guidelines for
module-level logging.
In `@examples/00_example/test_agent.py`:
- Around line 30-44: Multiple tests repeatedly instantiate gemini.LLM(MODEL);
introduce pytest fixtures to DRY this up by creating a function-scoped fixture
(e.g., llm) that returns gemini.LLM(MODEL) and another fixture (e.g., judge_llm)
for the judge LLM, then update tests like test_greeting to accept llm and
judge_llm as parameters and pass those into TestEval instead of constructing
gemini.LLM(MODEL) inline; ensure fixtures are imported/defined in the test
module so all tests that use TestEval (and other tests noted) reuse the
fixtures.
- Around line 129-131: Replace the obscure generator-throw lambda used for
"get_weather" with an async function that simply raises RuntimeError to match
the original tool's async signature; locate the "get_weather" entry in the test
agent setup (the lambda: (_ for _ in ()).throw(RuntimeError("Service
unavailable"))) and convert it to an async def get_weather(...) that raises
RuntimeError("Service unavailable") so the stub is readable and has the correct
async behavior.
- Around line 25-27: Replace the manual _skip_if_no_key() calls with a
centralized pytest skip so tests can't forget to check the env var: either add a
module-level pytestmark = pytest.mark.skipif(not os.getenv("GOOGLE_API_KEY"),
reason="GOOGLE_API_KEY not set") or create a session-scoped fixture (e.g.,
require_google_api_key) that checks os.getenv("GOOGLE_API_KEY") and calls
pytest.skip(...) if missing, then remove all calls to _skip_if_no_key() from
individual test functions; keep references to the existing helper name
_skip_if_no_key in case you want to deprecate/redirect it to the fixture for
backwards compatibility.
- Around line 93-94: The return type for get_weather uses an unparameterized
dict; update its annotation to a parameterized generic such as dict[str, Any]
(or a more specific mapping like dict[str, int | str]) and import Any from
typing; also find any other functions in this file that currently return bare
dict (e.g., the later async function returning a dict) and update their return
annotations similarly to use dict[str, Any] or a more specific type to follow
the modern typing guidelines.
In `@examples/01_simple_agent_example/simple_agent_example.py`:
- Line 2: Replace the deprecated typing.Dict usage with the built-in generic
dict: update the import line to remove Dict (keep Any if used) and change type
annotations like Dict[...] to dict[...] (e.g., in simple_agent_example.py
replace any Dict[str, Any] or Dict[...] at the import and at the usage on line
38 with dict[str, Any] or the appropriate dict[...] form); ensure all
occurrences of Dict are removed from imports and replaced in annotations while
preserving Any and other types.
In `@tests/test_testing/test_eval.py`:
- Around line 168-172: Replace the direct private-state mutation
response._cursor = 1 with the public API that advances or asserts consumption on
TestResponse; instead of setting _cursor directly in test_pass_at_end, call the
appropriate public method on response (for example response.consume_event() or
response.assert_events_consumed(1) / response.advance(n) depending on the
available API) so the test uses _make_response/_simple_events and
response.no_more_events() without touching private attributes.
_FakeLLM is a fake (working substitute with simplified logic), not a mock (call recorder with verification). The "never mock" guideline refers to unittest.mock / mock.patch, not test fakes. No change needed there, but the missing type annotations were a valid gap.
When mocks contained a valid tool followed by an unregistered one, the valid tool was already replaced before the KeyError was raised. Since the error happened before the try block, finally never ran and the LLM was left in a permanently mutated state. Fix: validate all tool names before swapping any implementations.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
agents-core/vision_agents/testing/_mock_tools.py (1)
47-55: Consider moving the swap loop inside thetryblock for belt-and-suspenders safety.If an unforeseen error were raised mid-swap (line 52), the
finallyblock on line 56 would never execute, leaving already-swapped tools unreachable. Moving the swap loop insidetrymeans any partially-populatedoriginalsdict would still be cleaned up byfinally. The practical risk is near-zero, but the fix is trivial and makes the guarantee airtight.♻️ Proposed restructure
originals: dict[str, Callable[..., Any]] = {} - for tool_name, mock_fn in mocks.items(): - func_def = registry._functions[tool_name] - originals[tool_name] = func_def.function - func_def.function = mock_fn - try: + for tool_name, mock_fn in mocks.items(): + func_def = registry._functions[tool_name] + originals[tool_name] = func_def.function + func_def.function = mock_fn + yield finally:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_mock_tools.py` around lines 47 - 55, The swap loop that replaces registry._functions[tool_name].function with mock_fns should be moved inside the try block so that the finally cleanup always runs even if an exception occurs mid-swap; specifically, in the context manager where originals: dict[str, Callable[..., Any]] = {} and you iterate over mocks.items(), perform the lookup of func_def, store originals[tool_name] = func_def.function and assign func_def.function = mock_fn inside the try before yielding, and keep the existing finally block to restore originals from originals[tool_name] — this ensures partially-swapped entries are tracked and will be restored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 47-55: The swap loop that replaces
registry._functions[tool_name].function with mock_fns should be moved inside the
try block so that the finally cleanup always runs even if an exception occurs
mid-swap; specifically, in the context manager where originals: dict[str,
Callable[..., Any]] = {} and you iterate over mocks.items(), perform the lookup
of func_def, store originals[tool_name] = func_def.function and assign
func_def.function = mock_fn inside the try before yielding, and keep the
existing finally block to restore originals from originals[tool_name] — this
ensures partially-swapped entries are tracked and will be restored.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
agents-core/vision_agents/testing/_mock_tools.py (2)
49-56: Move the swap loop inside thetryblock for complete restoration coverage.The swap loop runs before
try, so an unexpected exception duringfunc_def.function = mock_fn(e.g., a read-only property on an unusualfunc_deftype) would leave already-swapped tools unrestored. In practice, attribute assignment on a mutable object is infallible, but moving the loop inside thetryblock eliminates the residual theoretical gap at no cost.♻️ Proposed refactor
originals: dict[str, Callable[..., Any]] = {} + try: for tool_name, mock_fn in mocks.items(): func_def = registry._functions[tool_name] originals[tool_name] = func_def.function func_def.function = mock_fn - try: yield finally: for tool_name, original_fn in originals.items(): registry._functions[tool_name].function = original_fn🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_mock_tools.py` around lines 49 - 56, Move the swap loop that iterates over mocks.items() into the try block so that originals are recorded and replacements applied inside the try scope (using registry._functions, func_def.function, and originals dict) and the finally block can reliably restore originals even if an exception occurs during assignment; specifically, perform the for tool_name, mock_fn in mocks.items(): lookup func_def = registry._functions[tool_name], set originals[tool_name] = func_def.function, then set func_def.function = mock_fn all inside the try, leaving the finally to iterate originals and restore func_def.function.
41-52: Useregistry.get_function(name)instead of directly accessingregistry._functions.Lines 44, 50, and 58 reach into the private
_functionsdict. TheFunctionRegistryclass provides a publicget_function(name)method that should be used instead. While.functionis a public mutable attribute onFunctionDefinitionand_ExplicitSchemaFunction, the indirect private dict access creates coupling: ifFunctionRegistryever restructures its internal storage, this breaks silently.For a more robust design, consider adding public methods like
swap_implementation(name, fn)andrestore_implementation(name, fn)toFunctionRegistryto decouple mocking logic from internal implementation details.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_mock_tools.py` around lines 41 - 52, The code currently reaches into registry._functions to look up and replace FunctionDefinition implementations; replace those direct accesses with the public API by calling registry.get_function(tool_name) to retrieve the FunctionDefinition (or _ExplicitSchemaFunction) and check for None, then read/assign the .function attribute as before; update the lookup in the validation loop and the swap loop (where originals: dict[...] is populated and func_def.function = mock_fn occurs). Optionally, if FunctionRegistry supports or you add methods like swap_implementation(name, fn) and restore_implementation(name, fn), prefer calling those instead of mutating .function directly to decouple mocking logic from internal storage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 49-56: Move the swap loop that iterates over mocks.items() into
the try block so that originals are recorded and replacements applied inside the
try scope (using registry._functions, func_def.function, and originals dict) and
the finally block can reliably restore originals even if an exception occurs
during assignment; specifically, perform the for tool_name, mock_fn in
mocks.items(): lookup func_def = registry._functions[tool_name], set
originals[tool_name] = func_def.function, then set func_def.function = mock_fn
all inside the try, leaving the finally to iterate originals and restore
func_def.function.
- Around line 41-52: The code currently reaches into registry._functions to look
up and replace FunctionDefinition implementations; replace those direct accesses
with the public API by calling registry.get_function(tool_name) to retrieve the
FunctionDefinition (or _ExplicitSchemaFunction) and check for None, then
read/assign the .function attribute as before; update the lookup in the
validation loop and the swap loop (where originals: dict[...] is populated and
func_def.function = mock_fn occurs). Optionally, if FunctionRegistry supports or
you add methods like swap_implementation(name, fn) and
restore_implementation(name, fn), prefer calling those instead of mutating
.function directly to decouple mocking logic from internal storage.
TestEval → TestSession: the class manages a session, not an evaluation. Fix _FakeLLM.simple_response to match the base LLM signature. Add "never use from __future__ import annotations" rule to CLAUDE.md.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
agents-core/vision_agents/testing/_session.py (1)
50-55:__aenter__/__aexit__belong after private helpers per the project's method-order guideline.The guideline specifies:
__init__→ public lifecycle → properties → public feature methods → private helpers → dunder methods. Currently__aenter__/__aexit__are placed between__init__andstart/close, which inverts that order.♻️ Proposed reordering
def __init__(self, ...) -> None: ... - async def __aenter__(self) -> "TestSession": - await self.start() - return self - - async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None: - await self.close() - async def start(self) -> None: ... async def close(self) -> None: ... `@property` def llm(self) -> LLM: ... async def simple_response(self, text: str) -> TestResponse: ... async def _on_tool_start(self, event: ToolStartEvent) -> None: ... async def _on_tool_end(self, event: ToolEndEvent) -> None: ... + async def __aenter__(self) -> "TestSession": + await self.start() + return self + + async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None: + await self.close()As per coding guidelines: "Order class methods as:
__init__, public lifecycle methods, properties, public feature methods, private helpers, dunder methods."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 50 - 55, Move the async dunder context managers __aenter__ and __aexit__ to after the class's private helper methods so they follow the project's method-order guideline; locate the TestSession class methods __aenter__ and __aexit__ and cut them from their current position (now before start and close) and paste them below the private helpers section (after any methods like _<private_helper_name>), ensuring start and close remain in the public lifecycle area and the dunder methods appear last.agents-core/vision_agents/testing/_run_result.py (1)
220-224:_format_eventtruncatesChatMessageEventcontent without the ellipsis marker.Line 223 slices directly (
[:_PREVIEW_MAX_LEN]), which silently drops the rest of the string with no indication. Line 230 (theFunctionCallOutputEventbranch) correctly uses_truncate(), which appends"...". Debug output for assistant messages will appear untruncated when they happen to hit the limit exactly, making failure context harder to read.♻️ Proposed fix — use `_truncate` for consistency
if isinstance(event, ChatMessageEvent): - preview = event.content[:_PREVIEW_MAX_LEN].replace("\n", "\\n") + preview = TestResponse._truncate(event.content).replace("\n", "\\n") return f"ChatMessageEvent(role='{event.role}', content='{preview}')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 220 - 224, The ChatMessageEvent branch in _format_event currently slices content with event.content[:_PREVIEW_MAX_LEN] which silently drops text; update it to use the existing _truncate helper (same behavior as FunctionCallOutputEvent) so preview = _truncate(event.content, _PREVIEW_MAX_LEN). Keep the replace("\n", "\\n") and the f-string return unchanged, just build preview via _truncate to ensure an ellipsis is appended when truncated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 97-103: The current argument comparison uses
event.arguments.get(key) which returns None for missing keys, so
arguments={"city": None} will falsely pass when "city" is absent; update the
check in the arguments loop (the block that currently calls
event.arguments.get(key) and then compares actual != value) to first test key
presence (e.g., if key not in event.arguments) and raise via
_raise_with_debug_info for missing keys, otherwise compare event.arguments[key]
to the expected value and call _raise_with_debug_info on mismatch; keep using
the existing _raise_with_debug_info to surface failures.
In `@agents-core/vision_agents/testing/_session.py`:
- Around line 112-129: The _capturing flag is set before calling
self._conversation.send_message but the try/finally that clears it only
surrounds the LLM call, so if send_message raises _capturing stays True; fix by
expanding the try/finally to begin immediately after
self._captured_events.clear() (or by placing send_message inside the existing
try) so that any exception from self._conversation.send_message or await
self._llm.simple_response will always execute the finally that sets
self._capturing = False; keep the existing behavior of awaiting
self._event_manager.wait when present and ensure _captured_events is still
cleared at the start of the operation.
---
Nitpick comments:
In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 220-224: The ChatMessageEvent branch in _format_event currently
slices content with event.content[:_PREVIEW_MAX_LEN] which silently drops text;
update it to use the existing _truncate helper (same behavior as
FunctionCallOutputEvent) so preview = _truncate(event.content,
_PREVIEW_MAX_LEN). Keep the replace("\n", "\\n") and the f-string return
unchanged, just build preview via _truncate to ensure an ellipsis is appended
when truncated.
In `@agents-core/vision_agents/testing/_session.py`:
- Around line 50-55: Move the async dunder context managers __aenter__ and
__aexit__ to after the class's private helper methods so they follow the
project's method-order guideline; locate the TestSession class methods
__aenter__ and __aexit__ and cut them from their current position (now before
start and close) and paste them below the private helpers section (after any
methods like _<private_helper_name>), ensuring start and close remain in the
public lifecycle area and the dunder methods appear last.
| if arguments is not None: | ||
| for key, value in arguments.items(): | ||
| actual = event.arguments.get(key) | ||
| if actual != value: | ||
| self._raise_with_debug_info( | ||
| f"For argument '{key}', expected {value!r}, got {actual!r}" | ||
| ) |
There was a problem hiding this comment.
Partial-match silently passes when the expected value is None and the key is absent.
event.arguments.get(key) returns None for a missing key. If a test asserts arguments={"city": None} but the actual call doesn't include "city" at all, actual == None == value and no failure is raised — masking a genuine argument mismatch.
🐛 Proposed fix — check key presence separately
for key, value in arguments.items():
- actual = event.arguments.get(key)
- if actual != value:
+ if key not in event.arguments:
+ self._raise_with_debug_info(
+ f"Argument '{key}' not present in actual call arguments {list(event.arguments.keys())}"
+ )
+ if event.arguments[key] != value:
self._raise_with_debug_info(
f"For argument '{key}', expected {value!r}, got {actual!r}"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if arguments is not None: | |
| for key, value in arguments.items(): | |
| actual = event.arguments.get(key) | |
| if actual != value: | |
| self._raise_with_debug_info( | |
| f"For argument '{key}', expected {value!r}, got {actual!r}" | |
| ) | |
| if arguments is not None: | |
| for key, value in arguments.items(): | |
| if key not in event.arguments: | |
| self._raise_with_debug_info( | |
| f"Argument '{key}' not present in actual call arguments {list(event.arguments.keys())}" | |
| ) | |
| actual = event.arguments[key] | |
| if actual != value: | |
| self._raise_with_debug_info( | |
| f"For argument '{key}', expected {value!r}, got {actual!r}" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents-core/vision_agents/testing/_run_result.py` around lines 97 - 103, The
current argument comparison uses event.arguments.get(key) which returns None for
missing keys, so arguments={"city": None} will falsely pass when "city" is
absent; update the check in the arguments loop (the block that currently calls
event.arguments.get(key) and then compares actual != value) to first test key
presence (e.g., if key not in event.arguments) and raise via
_raise_with_debug_info for missing keys, otherwise compare event.arguments[key]
to the expected value and call _raise_with_debug_info on mismatch; keep using
the existing _raise_with_debug_info to surface failures.
| self._captured_events.clear() | ||
| self._capturing = True | ||
|
|
||
| if self._conversation is not None: | ||
| await self._conversation.send_message( | ||
| role="user", | ||
| user_id="test-user", | ||
| content=text, | ||
| ) | ||
|
|
||
| try: | ||
| response = await self._llm.simple_response(text=text) | ||
|
|
||
| if self._event_manager is not None: | ||
| await self._event_manager.wait(timeout=5.0) | ||
|
|
||
| finally: | ||
| self._capturing = False |
There was a problem hiding this comment.
_capturing stays True if send_message raises before the try/finally guard.
self._capturing = True is set at line 113, but the finally: self._capturing = False only covers lines 122–129. If await self._conversation.send_message(...) (lines 116–120) raises, _capturing is never reset. Any subsequent reuse of the session (e.g., after catching the exception) would encounter stale _capturing = True state, and tool events from unrelated coroutines could leak into _captured_events.
🐛 Proposed fix — extend the try/finally to cover send_message
self._captured_events.clear()
self._capturing = True
+ try:
if self._conversation is not None:
await self._conversation.send_message(
role="user",
user_id="test-user",
content=text,
)
- try:
response = await self._llm.simple_response(text=text)
if self._event_manager is not None:
await self._event_manager.wait(timeout=5.0)
finally:
self._capturing = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents-core/vision_agents/testing/_session.py` around lines 112 - 129, The
_capturing flag is set before calling self._conversation.send_message but the
try/finally that clears it only surrounds the LLM call, so if send_message
raises _capturing stays True; fix by expanding the try/finally to begin
immediately after self._captured_events.clear() (or by placing send_message
inside the existing try) so that any exception from
self._conversation.send_message or await self._llm.simple_response will always
execute the finally that sets self._capturing = False; keep the existing
behavior of awaiting self._event_manager.wait when present and ensure
_captured_events is still cleared at the start of the operation.
Motivation
Testing conversational AI agents today requires spinning up audio/video infrastructure, edge connections, and real model calls for every assertion. This makes tests hard to write.
vision_agents.testingprovides a lightweight, text-only testing layer that lets you verify agent behavior — tool calls, arguments, responses, and intent — using familiar pytest patterns.What's included
Core API
TestSession— async context manager that wraps an LLM for testing. Manages session lifecycle, captures events (tool calls + outputs + messages), and returns structured results.TestResponse— returned bysimple_response(). Carries events, timing, output text, and cursor-based assertion methods:function_called(name, arguments=)— assert a tool was called with expected args (partial match)function_output(output=, is_error=)— assert tool outputjudge(intent=)— assert the assistant message (optionally evaluated by a judge LLM)no_more_events()— assert nothing unexpected happened after your checksmock_tools(llm, {...})— context manager to temporarily swap tool implementations without changing the schema visible to the LLMLLM-as-judge
evaluate_intent()— sends the agent's message + a target intent to a separate judge LLM, gets a structured PASS/FAIL verdict. Used automatically when you callresponse.judge(intent="...").Event types
ChatMessageEvent,FunctionCallEvent,FunctionCallOutputEvent— normalized dataclasses representing what happened during a turnRunEvent— union type of all threeUsage
Verify a greeting:
Verify tool calls and response:
Mock tools for deterministic testing:
Design decisions
function_called()auto-skips the followingFunctionCallOutputEventfor convenience.judge()works without intent evaluation too.function_called("tool", arguments={"key": "val"})only checks specified keys. Extra arguments are ignored.@pytest.mark.asyncioneeded (asyncio_mode = auto), clean tracebacks via__tracebackhide__.Files
Test plan
uv run py.test tests/test_testing/ -m "not integration" -n auto— all unit tests passuv run ruff check .— no lint issuesuv run ruff format --check .— formatteduv run mypy— no type errorsuv run py.test examples/01_simple_agent_example/ -m integrationSummary by CodeRabbit
New Features
Tests