-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Summary
TurnAnalyzerUserTurnStopStrategy only processes TranscriptionFrame (finalized) to set _text, but ignores InterimTranscriptionFrame. When an STT service delays finalized transcriptions for short utterances until more speech arrives, the stop strategy can never trigger — _turn_complete gets set by VAD stop, but _text stays empty.
This causes a deadlock where the bot goes silent for as long as the user stays quiet.
Reproduction scenario
- User says "I need help with my order" → turn completes → LLM triggers
- User says "Billing." (short word) while LLM is generating →
InterimTranscriptionFrame("Billing")arrives TranscriptionUserTurnStartStrategyfires (from the interim) → new turn starts, LLM interrupted- VAD stop fires ~0.4s later →
TurnAnalyzerUserTurnStopStrategysets_turn_complete = True - But
_textis still""because_handle_transcriptiononly processesTranscriptionFrame:
# turn_analyzer_user_turn_stop_strategy.py line 106
elif isinstance(frame, TranscriptionFrame):
await self._handle_transcription(frame)_maybe_trigger_user_turn_stopped()returns early:
# line 213
if not self._text or not self._turn_complete:
return- The STT holds the finalized transcription for "Billing." until the next sentence arrives — which could be 15+ seconds later
- Bot sits silent until the user speaks again
Root cause
InterimTranscriptionFrame and TranscriptionFrame are separate classes (both inherit from TextFrame), so isinstance(frame, TranscriptionFrame) doesn't match interims. The stop strategy has no way to populate _text from interim transcriptions.
STT behavior context
Some STT services don't finalize short utterances immediately. A single word like "Billing." may only produce an InterimTranscriptionFrame, with the finalized TranscriptionFrame arriving only when the user speaks again (e.g., "Billing. Hello?" as a single finalized transcription 15+ seconds later).
This may also be an STT configuration issue on our side — any guidance on expected STT finalization behavior would be helpful.
Suggested fix
Have TurnAnalyzerUserTurnStopStrategy.process_frame also handle InterimTranscriptionFrame to set _text, similar to how it handles TranscriptionFrame:
elif isinstance(frame, InterimTranscriptionFrame):
text = frame.text.strip()
if text:
self._text = text
await self._maybe_trigger_user_turn_stopped()
# Fallback path for no-VAD-stop scenario
if not self._vad_user_speaking and self._vad_stopped_time is None:
self._turn_complete = True
...Workaround
We currently work around this by setting TranscriptionUserTurnStartStrategy(enable_interruptions=False) and increasing VAD stop_secs to give users more time to finish multi-word answers within a single turn.
Environment
- pipecat 0.0.104
- Turn analyzer:
LocalSmartTurnAnalyzerV3withSmartTurnParams(stop_secs=2)