Unified agent loop for Responses API: design and implementation plan

## Summary
Replace the three surface-specific Responses agent loops (OpenAI non-streaming + streaming, gRPC regular, gRPC harmony) with one shared agent loop used by every surface. Every Responses request enters the same loop, which decides the next action from state:

- `CallLlm`
- `ExecuteTools` (gateway-owned tools: MCP, builtins)
- `InterruptForApproval`
- `Finish`

A request with no MCP tools still enters the loop; it simply never produces an `ExecuteTools`.

## Problem
Agentic control flow in the Responses routers is currently spread across layers:

1. **Loop entry is decided outside the loop.** Each surface asks "does this request carry MCP tools?" and chooses between a one-shot path and a tool-loop path. Two paths exist per surface, so history loading, request rebuilding, final response assembly, persistence, and streaming completion are duplicated 3×.
2. **Surface logic and loop logic are intermixed.** "What happens next?" is spread across route handlers, history loaders, streaming interceptors, and MCP execution helpers. Adding a behavior (approval continuation, `max_tool_calls` accounting, visibility rules) creates another loop-external patch point instead of extending one explicit state machine.
3. **Three representations interleave without explicit boundaries.** The client `ResponsesRequest`/`ResponsesResponse`, the upstream model payload, and the stored response chain used by `previous_response_id` all flow through the same router code without being clearly separated. This already causes observable parity drift vs OpenAI — e.g. before #1315 landed, `previous_response_id` replay silently dropped `mcp_list_tools`/`mcp_call` items and re-executed the tool.

## Goals
- Every Responses surface enters the same agent loop.
- The loop decides the next action from state; router code contains no loop-entry branching.
- Representation boundaries are explicit:
  - **client transcript** — `ResponsesRequest` / `ResponsesResponse` (API + storage contract)
  - **canonical loop transcript** — `Vec<ResponseInputOutputItem>` (shared execution model across surfaces)
  - **upstream payload** — provider-specific (`Responses` JSON, `ChatCompletionRequest`, Harmony pipeline input)
- Streaming and non-streaming share the driver; surface adapters own only parser-local logic and wire-level event translation.
- `max_tool_calls`, approval interrupts, `mcp_list_tools` dedupe, and hidden-MCP visibility are properties of loop state, not scattered across router files.
- Landing happens in small, independently reviewable PRs.

## Non-goals
- Rewriting the MCP crate or approval subsystem from scratch.
- Changing storage schema.
- Unifying every provider-specific stream parser.
- Expanding approval workflow behavior (denial policy, etc.) as part of the architecture PRs.
- Changing public API contracts for Responses requests or responses.

## Prior validation
The full plan was implemented end-to-end once as a working prototype with contract-level OpenAI parity checks for non-streaming + streaming MCP flows including approval interrupts. Local `smg` ran against real OpenAI using the same test-plan style as #1174 and matched OpenAI's behavior at the contract level. This umbrella re-lands that proven design on `main` in reviewable pieces.

## Target end-to-end flow

```text
Responses request
  |
  +-- validate / select worker
  |
  +-- load history (previous_response_id, conversation, stitched input)
  |
  +-- prepare_agent_loop_input  → canonicalize transcript (single boundary)
  |
  +-- create AgentLoopContext
  |
  \-- run_agent_loop
         |
         +-- NextAction::CallLlm
         +-- NextAction::ExecuteTools
         +-- NextAction::InterruptForApproval
         \-- NextAction::Finish
```

## NextAction semantics

- `CallLlm` — build the next upstream request from the canonical transcript and run one model turn.
- `ExecuteTools(Vec<PlannedToolExecution>)` — execute gateway-owned tools only (MCP, builtins). User-defined function tools are not gateway-executed; they remain in the final response.
- `InterruptForApproval(PendingToolExecution)` — render an approval interrupt response and return. A first-class loop action, not a special path outside the loop.
- `Finish` — render the final response: restore client-facing tool view, inject visible MCP metadata, apply hidden-MCP filtering, patch `previous_response_id`.

## Proposed shared module layout (lands with PR6)

```text
model_gateway/src/routers/common/agent_loop/
  mod.rs
  prepared.rs     # PreparedLoopInput and history-facing types
  state.rs        # AgentLoopState, LoopModelTurn, LoopToolCall, NextAction
  driver.rs       # run_agent_loop() and decide_next_action()
  events.rs       # semantic streaming events for adapters
  tooling.rs      # MCP execution planning and approval continuation helpers
```

Each surface keeps a thin adapter next to its router:

```text
model_gateway/src/routers/openai/responses/agent_loop_adapter.rs
model_gateway/src/routers/grpc/harmony/responses/agent_loop_adapter.rs
model_gateway/src/routers/grpc/regular/responses/agent_loop_adapter.rs
```

## Key implementation invariants

- `mcp_call` always normalizes into `function_call` + `function_call_output` before upstream. When a replayed `mcp_call` carries an `error`, the error string must be surfaced via `function_call_output.output` rather than dropped.
- `mcp_list_tools` dedupe is keyed on `server_label`. If an item was already emitted during streaming, the final `response.completed` payload must reuse the same `id`.
- Upstream replay payloads never reintroduce client-visible-only control items (`mcp_list_tools`, `mcp_approval_request`).
- `effective_limit = min(user_max_tool_calls, DEFAULT_MAX_ITERATIONS)` is a public behavior contract. Approved continuations obey the same budget.
- Incomplete termination (tool-call limit) returns `status=completed` + `incomplete_details.reason="max_tool_calls"` for both streaming and non-streaming; streaming terminates with `response.completed` + `[DONE]`, not a generic `error` event.
- Stream sink does event translation only. It does not own loop-control decisions (when to call the model, when to execute tools, when a continuation is valid, when to interrupt).
- Until shared extraction lands, surface routers must feed the normalized `PreparedLoopInput.upstream_input` into their `RequestContext` and restore `store`, `previous_response_id`, and `conversation` on top so persistence and response-metadata patching keep client-intent values.

## Shipping slices

### Slice A — OpenAI reference implementation
- #1317 **[UAL-PR-01]** Canonical input preparation. **In review as #1315.**
- #1318 **[UAL-PR-02]** Non-streaming agent loop.
- #1319 **[UAL-PR-03]** Streaming agent loop and stream sink.
- #1320 **[UAL-PR-04]** Streaming approval parity.

### Slice B — Shared abstractions
- #1321 **[UAL-PR-05]** Shared MCP presentation and visibility boundary.
- #1322 **[UAL-PR-06]** Extract `routers/common/agent_loop`.

### Slice C — Other surfaces
- #1323 **[UAL-PR-07]** Harmony adapter migration.
- #1324 **[UAL-PR-08]** Regular (Chat-adapter) adapter migration.

### Hardening
- #1325 **[UAL-PR-09]** Approval continuation generalization.

## Validation gate (per PR)

- `cargo test -p openai-protocol --tests`
- `cargo test --lib --package smg routers::openai::responses::`
- `cargo test --test api_tests -- responses`
- `pre-commit run --all-files`
- Manual parity against real OpenAI using the same style as #1174's test plan: start local `smg` with `--enable-igw --port 9999`, register OpenAI as an external worker, compare contract-level output (item types, streaming event families, interrupt boundaries, error shapes) between SMG and direct OpenAI.

## Implementation issues
Child issues will be linked into the shipping slices above as they are filed. Each PR body should reference `Parent: #<this>` and `Closes #<UAL-PR-NN>`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified agent loop for Responses API: design and implementation plan #1316

Summary

Problem

Goals

Non-goals

Prior validation

Target end-to-end flow

NextAction semantics

Proposed shared module layout (lands with PR6)

Key implementation invariants

Shipping slices

Slice A — OpenAI reference implementation

Slice B — Shared abstractions

Slice C — Other surfaces

Hardening

Validation gate (per PR)

Implementation issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unified agent loop for Responses API: design and implementation plan #1316

Description

Summary

Problem

Goals

Non-goals

Prior validation

Target end-to-end flow

NextAction semantics

Proposed shared module layout (lands with PR6)

Key implementation invariants

Shipping slices

Slice A — OpenAI reference implementation

Slice B — Shared abstractions

Slice C — Other surfaces

Hardening

Validation gate (per PR)

Implementation issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions