Real-time speech-to-text WebSocket server with pluggable ASR backends.
WebSocket Client
│
▼
┌─────────────────────────────────────┐
│ FastAPI (async) │
│ │
│ _recv_pump ──► in_queue │
│ │ │
│ run_worker (thread) │
│ │ │
│ _send_pump ◄── out_queue │
└─────────────────────────────────────┘
│
▼
ASR Backend (Qwen3 / Mock)
- Async I/O for WebSocket recv/send via FastAPI
- Sync worker thread for inference (bridged with janus queues)
- Pluggable backends via
ASRBackendprotocol - Energy-based VAD for utterance endpoint detection (numpy-based RMS)
- Prometheus metrics at
/metricsand health check at/health - Adaptive polling (50ms active, 200ms idle) for CPU efficiency
- Partial backpressure (drops partials when output queue >80%)
- Drop-oldest overflow handling for queue overflow protection
| Backend | Description | Requires |
|---|---|---|
qwen3 |
Qwen3-ASR-1.7B via embedded vLLM (in-process GPU) | CUDA + vllm[audio] |
mock |
Fixture-based replay for testing | Nothing extra |
Set via STT_BACKEND environment variable (default: qwen3).
| Phase | Backend | Model | Status |
|---|---|---|---|
| Phase 1 | qwen3 |
Qwen3-ASR-1.7B (vLLM) | Done |
| Phase 2 | whisper |
Whisper | Planned |
| Phase 3 | nemo |
NVIDIA NeMo | Planned |
# Install core + dev dependencies
uv sync --group dev
# With Qwen3 backend (requires CUDA)
uv sync --group dev --extra qwen3# Run with mock backend (no GPU needed)
STT_BACKEND=mock uvicorn app:app --host 0.0.0.0 --port 8000
# Run with Qwen3 backend (requires CUDA)
STT_BACKEND=qwen3 uvicorn app:app --host 0.0.0.0 --port 8000Connect to ws://host:port/v1/stt/ws with query parameters:
| Parameter | Required | Default | Description |
|---|---|---|---|
call_id |
yes | - | Unique call identifier |
sample_rate |
yes | - | Must be 16000 |
codec |
no | pcm_s16le |
Audio codec |
channels |
no | 1 |
Must be mono |
frame_ms |
no | 20 |
Frame duration in ms |
partial_ms |
no | 500 |
Partial result interval in ms |
language |
no | auto |
BCP-47 code (e.g. zh, en, ja) |
hotwords |
no | - | Comma-separated hotwords |
Send: binary PCM16LE audio frames, or JSON {"type": "commit"} to force-finalize.
Receive: JSON messages:
For batch transcription, use POST /v1/stt/transcribe:
# Upload a WAV file for transcription
curl -X POST http://localhost:8000/v1/stt/transcribe \
-F "file=@audio.wav" \
-G -d "language=zh"
# Response:
# {
# "text": "转写文本内容",
# "duration_ms": 3200.0,
# "latency_ms": 450.3
# }Supported audio formats: WAV (PCM 16/32-bit, IEEE Float 32/64-bit). Audio is automatically converted to 16kHz mono.
All settings use the STT_ env prefix:
| Variable | Default | Description |
|---|---|---|
STT_BACKEND |
qwen3 |
Backend name (qwen3 or mock) |
STT_HOST |
0.0.0.0 |
Server bind host |
STT_PORT |
8000 |
Server bind port |
STT_IN_QUEUE_SIZE |
200 |
Input queue capacity |
STT_IN_QUEUE_MAX_DROPS |
500 |
Max input queue drops before error |
STT_OUT_QUEUE_SIZE |
50 |
Output queue capacity |
STT_MAX_UPLOAD_BYTES |
104857600 |
Max upload size (100MB) |
STT_VLLM_MODEL |
Qwen/Qwen3-ASR-1.7B |
vLLM model name |
STT_VLLM_GPU_MEMORY |
0.7 |
GPU memory utilization (0-1) |
STT_VLLM_MAX_TOKENS |
512 |
Max output tokens |
STT_VLLM_MAX_INFERENCE_BATCH_SIZE |
32 |
Max batch size for inference |
STT_VAD_THRESHOLD |
300 |
RMS energy threshold for speech |
STT_VAD_SILENCE_MS |
500 |
Silence duration to trigger endpoint |
STT_STREAMING_CHUNK_SIZE_SEC |
0.5 |
Streaming chunk size in seconds |
# Run mock tests (no GPU needed)
STT_BACKEND=mock pytest tests/mock/ -v
# Run qwen3 tests (requires CUDA)
STT_BACKEND=qwen3 pytest tests/qwen3/ -v├── app.py # FastAPI entry point
├── config.py # Pydantic settings
├── backends/
│ ├── base.py # ASRBackend protocol + ASRResult
│ ├── registry.py # Backend factory
│ ├── mock.py # Fixture-based mock backend
│ └── qwen3.py # vLLM Qwen3 backend
├── inference/
│ └── worker.py # Sync inference worker thread
├── protocol/
│ └── schema.py # Message schemas & validation
├── session/
│ └── state.py # Per-connection session state
├── transport/
│ └── ws.py # WebSocket router & pumps
├── observability/
│ ├── logging.py # Structured logging (structlog)
│ └── metrics.py # Prometheus metrics
├── references/ # Test fixtures (audio + text)
└── tests/ # Unit & integration tests