Skip to content

TurnAnalyzerUserTurnStopStrategy ignores InterimTranscriptionFrame, causing deadlock when STT delays finalized transcriptions #3988

@OlliTapio

Description

@OlliTapio

Summary

TurnAnalyzerUserTurnStopStrategy only processes TranscriptionFrame (finalized) to set _text, but ignores InterimTranscriptionFrame. When an STT service delays finalized transcriptions for short utterances until more speech arrives, the stop strategy can never trigger — _turn_complete gets set by VAD stop, but _text stays empty.

This causes a deadlock where the bot goes silent for as long as the user stays quiet.

Reproduction scenario

  1. User says "I need help with my order" → turn completes → LLM triggers
  2. User says "Billing." (short word) while LLM is generating → InterimTranscriptionFrame("Billing") arrives
  3. TranscriptionUserTurnStartStrategy fires (from the interim) → new turn starts, LLM interrupted
  4. VAD stop fires ~0.4s later → TurnAnalyzerUserTurnStopStrategy sets _turn_complete = True
  5. But _text is still "" because _handle_transcription only processes TranscriptionFrame:
# turn_analyzer_user_turn_stop_strategy.py line 106
elif isinstance(frame, TranscriptionFrame):
    await self._handle_transcription(frame)
  1. _maybe_trigger_user_turn_stopped() returns early:
# line 213
if not self._text or not self._turn_complete:
    return
  1. The STT holds the finalized transcription for "Billing." until the next sentence arrives — which could be 15+ seconds later
  2. Bot sits silent until the user speaks again

Root cause

InterimTranscriptionFrame and TranscriptionFrame are separate classes (both inherit from TextFrame), so isinstance(frame, TranscriptionFrame) doesn't match interims. The stop strategy has no way to populate _text from interim transcriptions.

STT behavior context

Some STT services don't finalize short utterances immediately. A single word like "Billing." may only produce an InterimTranscriptionFrame, with the finalized TranscriptionFrame arriving only when the user speaks again (e.g., "Billing. Hello?" as a single finalized transcription 15+ seconds later).

This may also be an STT configuration issue on our side — any guidance on expected STT finalization behavior would be helpful.

Suggested fix

Have TurnAnalyzerUserTurnStopStrategy.process_frame also handle InterimTranscriptionFrame to set _text, similar to how it handles TranscriptionFrame:

elif isinstance(frame, InterimTranscriptionFrame):
    text = frame.text.strip()
    if text:
        self._text = text
        await self._maybe_trigger_user_turn_stopped()
        # Fallback path for no-VAD-stop scenario
        if not self._vad_user_speaking and self._vad_stopped_time is None:
            self._turn_complete = True
            ...

Workaround

We currently work around this by setting TranscriptionUserTurnStartStrategy(enable_interruptions=False) and increasing VAD stop_secs to give users more time to finish multi-word answers within a single turn.

Environment

  • pipecat 0.0.104
  • Turn analyzer: LocalSmartTurnAnalyzerV3 with SmartTurnParams(stop_secs=2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions