FEAT: Realtime streaming session support and server-side barge-in attack#1766
FEAT: Realtime streaming session support and server-side barge-in attack#1766adrian-gavrila wants to merge 25 commits into
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t API Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ardown Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nc rename, Optional→union Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…me-server-vad # Conflicts: # pyrit/prompt_target/openai/openai_realtime_target.py
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…imitive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… inline drive_response Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| raw_buffer: bytearray = field(default_factory=bytearray) | ||
| turn_lock: asyncio.Lock = field(default_factory=asyncio.Lock) | ||
| last_assistant_message: Message | None = None | ||
| executed_turns: int = 0 |
There was a problem hiding this comment.
i think should be last_assistant_message and executed_turns be in the context instead so the state only contains variables related to the async functionality
| #: Maximum time to wait after the chunk source exhausts for any in-flight VAD-committed | ||
| #: turn to finish (commit → convert → response.create → response.done → persist). Acts as | ||
| #: a safety cap; the attack returns as soon as the last turn actually completes. | ||
| _MAX_POST_STREAM_WAIT_SECONDS = 30.0 |
There was a problem hiding this comment.
is it valuable to have this be configurable?
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| _REALTIME_SAMPLE_RATE_HZ = 24000 |
There was a problem hiding this comment.
this is specific to OAI realtime target so we should probably change it to make it configurable
|
I'm a little concerned with how this attack deviates from the other attacks in that it doesn't use the send_prompt_async workflow. The attack is doing a lot of plumbing that other attacks delegate to send_async — audio normalization (bypasses the pipeline), the swap + response trigger, message construction, memory persistence. That makes it inconsistent with the standard "build prompt → send_prompt_async → response" workflow and tightly couples it to RealtimeTarget internals. Could we push the streaming session work into RealtimeTarget.send_async so the attack just does: Connect + subscribe |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Adds persistent streaming session support to
OpenAIRealtimeTargetand introducesBargeInAttack, a streaming attack that leverages server-side VAD to detect and exploit barge-in (interruption) behavior. Previously the target only supported single-turn fire-and-forget audio exchanges; this PR adds the transport primitives needed for multi-turn streaming sessions with incremental audio push, event subscription, and mid-session response requests.When the server detects new user speech while the assistant is still responding, the in-flight response is automatically interrupted and the conversation history is truncated to match what was actually delivered.
Key additions:
OpenAIRealtimeTargetstreaming primitives —connect_async,push_audio_chunk_async,insert_user_audio_async,subscribe_events_async,request_response_async,send_streaming_session_config_async. These expose transport-level operations over a persistent WebSocket connection._RealtimeEventDispatcher— ABC that owns a realtime connection's event stream, routes provider-specific events to the active turn, and fires anon_user_audio_committedcallback when server VAD finalizes a turn. Provider-specific routing is isolated to_route_event/_cancelabstract methods.BargeInAttack— streaming attack that pushes audio chunks into a persistent session, applies configured converters on each server-committed turn (convert-on-commit), requests responses, and tracks interruptions. Per-turnMessagepairs are persisted toCentralMemorywithprompt_metadata["interrupted"] = Trueon interrupted turns.ServerVadConfig/RealtimeTargetResult— shared types for configuring server VAD and representing turn results (audio, transcripts, interruption flag).PromptNormalizer.convert_audio_async— applies audio converter configurations to raw PCM bytes for streaming attacks that hold audio mid-turn rather than aMessage.The target exposes only transport primitives; all attack logic (buffering, convert-on-commit dance, interruption signaling) lives in
BargeInAttack.Tests and Documentation
realtime_audio.py, 72% onopenai_realtime_target.py(uncovered lines are pre-existing code paths, not new additions).doc/code/executor/attack/barge_in_attack.py(jupytext py:percent format) demonstrates the attack against a live OpenAI Realtime API endpoint with server VAD. Ran successfully againstgpt-4o-realtime-preview— outputs cleared for CI (requires live credentials).