Skip to content

InterruptibleTTSService: _bot_speaking guard causes interruption to fail when TTS audio hasn't reached output transport yet #3986

@yuki901

Description

@yuki901

Description

InterruptibleTTSService._handle_interruption() only disconnects/reconnects the WebSocket when _bot_speaking=True. However, interruption can occur before BotStartedSpeakingFrame is emitted — when TTS audio is still being synthesized or in transit to the output transport. In this case, _bot_speaking is False, the WebSocket is not disconnected, and the TTS server continues sending audio that gets played after the interruption.

This is a regression from the fix in #950 / PR #1272, which was later generalized into InterruptibleTTSService. The _bot_speaking guard was added as an optimization but introduces a race condition.

Root Cause

In tts_service.py lines 1429-1433:

async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
    await super()._handle_interruption(frame, direction)
    if self._bot_speaking:          # <-- This guard is the problem
        await self._disconnect()
        await self._connect()

_bot_speaking is set to True only when BotStartedSpeakingFrame is received (line 1444), which happens when audio reaches the output transport and starts playing. If the user interrupts before that point, the guard fails and the WebSocket stays connected.

Affected Services

All TTS services inheriting from InterruptibleTTSService that don't override on_audio_context_interrupted():

  • Fish Audio (fish/tts.py)
  • LMNT (lmnt/tts.py)
  • Neuphonic (neuphonic/tts.py)
  • Sarvam (sarvam/tts.py)

Services NOT Affected

These services correctly override on_audio_context_interrupted() to cancel server-side synthesis regardless of _bot_speaking state:

  • ElevenLabs — sends close_context message via WebSocket
  • Rime — sends clear message to server
  • Deepgram — sends {"type": "Clear"} to server

Steps to Reproduce

  1. Configure a pipeline with any affected TTS service (e.g., Fish Audio) and allow_interruptions=True
  2. Send a prompt that triggers a multi-sentence LLM response
  3. Speak (trigger UserStartedSpeakingFrame) before the first sentence's TTS audio starts playing (before BotStartedSpeakingFrame)
  4. Observe that InterruptionFrame fires, but the TTS audio still arrives and plays afterward

Log Evidence

49.216  [OUTPUT] UserStartedSpeakingFrame DOWNSTREAM    ← User interrupts
49.237  [OUTPUT] InterruptionFrame DOWNSTREAM            ← Interruption fires
49.239  on_assistant_turn_stopped                        ← LLM response complete
49.419  [OUTPUT] BotStartedSpeakingFrame DOWNSTREAM      ← Audio plays AFTER interruption (bug)

Note: BotStartedSpeakingFrame (49.419) occurs after InterruptionFrame (49.237), meaning _bot_speaking was False when _handle_interruption ran.

Suggested Fix

Option A: Remove the _bot_speaking guard — always disconnect/reconnect on interruption:

async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
    await super()._handle_interruption(frame, direction)
    await self._disconnect()
    await self._connect()

Option B: Override on_audio_context_interrupted() in each affected service (similar to ElevenLabs/Rime/Deepgram) to cancel server-side synthesis. This is called from super()._handle_interruption() regardless of _bot_speaking.

Environment

  • Pipecat version: 0.0.dev8273
  • TTS: Fish Audio (FishAudioTTSService)
  • Transport: Twilio Media Streams (WebSocket) via Daily
  • Python: 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions