Skip to content

Unified agent loop for Responses API: design and implementation plan #1316

@zhaowenzi

Description

@zhaowenzi

Summary

Replace the three surface-specific Responses agent loops (OpenAI non-streaming + streaming, gRPC regular, gRPC harmony) with one shared agent loop used by every surface. Every Responses request enters the same loop, which decides the next action from state:

  • CallLlm
  • ExecuteTools (gateway-owned tools: MCP, builtins)
  • InterruptForApproval
  • Finish

A request with no MCP tools still enters the loop; it simply never produces an ExecuteTools.

Problem

Agentic control flow in the Responses routers is currently spread across layers:

  1. Loop entry is decided outside the loop. Each surface asks "does this request carry MCP tools?" and chooses between a one-shot path and a tool-loop path. Two paths exist per surface, so history loading, request rebuilding, final response assembly, persistence, and streaming completion are duplicated 3×.
  2. Surface logic and loop logic are intermixed. "What happens next?" is spread across route handlers, history loaders, streaming interceptors, and MCP execution helpers. Adding a behavior (approval continuation, max_tool_calls accounting, visibility rules) creates another loop-external patch point instead of extending one explicit state machine.
  3. Three representations interleave without explicit boundaries. The client ResponsesRequest/ResponsesResponse, the upstream model payload, and the stored response chain used by previous_response_id all flow through the same router code without being clearly separated. This already causes observable parity drift vs OpenAI — e.g. before refactor(responses): split load/prepare for canonical agent-loop input #1315 landed, previous_response_id replay silently dropped mcp_list_tools/mcp_call items and re-executed the tool.

Goals

  • Every Responses surface enters the same agent loop.
  • The loop decides the next action from state; router code contains no loop-entry branching.
  • Representation boundaries are explicit:
    • client transcriptResponsesRequest / ResponsesResponse (API + storage contract)
    • canonical loop transcriptVec<ResponseInputOutputItem> (shared execution model across surfaces)
    • upstream payload — provider-specific (Responses JSON, ChatCompletionRequest, Harmony pipeline input)
  • Streaming and non-streaming share the driver; surface adapters own only parser-local logic and wire-level event translation.
  • max_tool_calls, approval interrupts, mcp_list_tools dedupe, and hidden-MCP visibility are properties of loop state, not scattered across router files.
  • Landing happens in small, independently reviewable PRs.

Non-goals

  • Rewriting the MCP crate or approval subsystem from scratch.
  • Changing storage schema.
  • Unifying every provider-specific stream parser.
  • Expanding approval workflow behavior (denial policy, etc.) as part of the architecture PRs.
  • Changing public API contracts for Responses requests or responses.

Prior validation

The full plan was implemented end-to-end once as a working prototype with contract-level OpenAI parity checks for non-streaming + streaming MCP flows including approval interrupts. Local smg ran against real OpenAI using the same test-plan style as #1174 and matched OpenAI's behavior at the contract level. This umbrella re-lands that proven design on main in reviewable pieces.

Target end-to-end flow

Responses request
  |
  +-- validate / select worker
  |
  +-- load history (previous_response_id, conversation, stitched input)
  |
  +-- prepare_agent_loop_input  → canonicalize transcript (single boundary)
  |
  +-- create AgentLoopContext
  |
  \-- run_agent_loop
         |
         +-- NextAction::CallLlm
         +-- NextAction::ExecuteTools
         +-- NextAction::InterruptForApproval
         \-- NextAction::Finish

NextAction semantics

  • CallLlm — build the next upstream request from the canonical transcript and run one model turn.
  • ExecuteTools(Vec<PlannedToolExecution>) — execute gateway-owned tools only (MCP, builtins). User-defined function tools are not gateway-executed; they remain in the final response.
  • InterruptForApproval(PendingToolExecution) — render an approval interrupt response and return. A first-class loop action, not a special path outside the loop.
  • Finish — render the final response: restore client-facing tool view, inject visible MCP metadata, apply hidden-MCP filtering, patch previous_response_id.

Proposed shared module layout (lands with PR6)

model_gateway/src/routers/common/agent_loop/
  mod.rs
  prepared.rs     # PreparedLoopInput and history-facing types
  state.rs        # AgentLoopState, LoopModelTurn, LoopToolCall, NextAction
  driver.rs       # run_agent_loop() and decide_next_action()
  events.rs       # semantic streaming events for adapters
  tooling.rs      # MCP execution planning and approval continuation helpers

Each surface keeps a thin adapter next to its router:

model_gateway/src/routers/openai/responses/agent_loop_adapter.rs
model_gateway/src/routers/grpc/harmony/responses/agent_loop_adapter.rs
model_gateway/src/routers/grpc/regular/responses/agent_loop_adapter.rs

Key implementation invariants

  • mcp_call always normalizes into function_call + function_call_output before upstream. When a replayed mcp_call carries an error, the error string must be surfaced via function_call_output.output rather than dropped.
  • mcp_list_tools dedupe is keyed on server_label. If an item was already emitted during streaming, the final response.completed payload must reuse the same id.
  • Upstream replay payloads never reintroduce client-visible-only control items (mcp_list_tools, mcp_approval_request).
  • effective_limit = min(user_max_tool_calls, DEFAULT_MAX_ITERATIONS) is a public behavior contract. Approved continuations obey the same budget.
  • Incomplete termination (tool-call limit) returns status=completed + incomplete_details.reason="max_tool_calls" for both streaming and non-streaming; streaming terminates with response.completed + [DONE], not a generic error event.
  • Stream sink does event translation only. It does not own loop-control decisions (when to call the model, when to execute tools, when a continuation is valid, when to interrupt).
  • Until shared extraction lands, surface routers must feed the normalized PreparedLoopInput.upstream_input into their RequestContext and restore store, previous_response_id, and conversation on top so persistence and response-metadata patching keep client-intent values.

Shipping slices

Slice A — OpenAI reference implementation

Slice B — Shared abstractions

Slice C — Other surfaces

Hardening

Validation gate (per PR)

  • cargo test -p openai-protocol --tests
  • cargo test --lib --package smg routers::openai::responses::
  • cargo test --test api_tests -- responses
  • pre-commit run --all-files
  • Manual parity against real OpenAI using the same style as feat(responses): interrupt approval-required MCP tool calls #1174's test plan: start local smg with --enable-igw --port 9999, register OpenAI as an external worker, compare contract-level output (item types, streaming event families, interrupt boundaries, error shapes) between SMG and direct OpenAI.

Implementation issues

Child issues will be linked into the shipping slices above as they are filed. Each PR body should reference Parent: #<this> and Closes #<UAL-PR-NN>.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestgrpcgRPC client and router changesmcpMCP related changesmodel-gatewayModel gateway crate changesopenaiOpenAI router changesprotocolsProtocols crate changes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions