This document describes the WebSocket and REST API endpoints for the Janus backend.
Endpoint: ws://localhost:8000/ws/janus
The WebSocket connection provides bidirectional communication between the frontend and backend. The backend sends transcript and packet summary events, while the frontend sends control messages to update engine state.
Sent from the frontend to update the engine control state. Any field set to null or omitted indicates "no change" to that field.
Type: "control"
Fields:
type(string, required): Always"control"is_streaming(boolean, optional): Enable/disable streaming mode (VAD-based processing)is_recording(boolean, optional): Enable/disable recording mode (hold-to-record)mode(string, optional): Transmission mode ("semantic","text_only", or"morse")emotion_override(string, optional): Emotion override ("auto","relaxed", or"panicked")
Example:
{
"type": "control",
"is_recording": true,
"mode": "semantic",
"emotion_override": "auto"
}Sent from the backend when a speech segment is transcribed.
Type: "transcript"
Fields:
type(string, required): Always"transcript"text(string, required): Transcribed text contentstart_ms(integer, optional): Start timestamp in millisecondsend_ms(integer, optional): End timestamp in millisecondsavg_pitch_hz(float, optional): Average pitch in Hz (F0)avg_energy(float, optional): Average energy level
Example:
{
"type": "transcript",
"text": "Hello, this is a test message",
"start_ms": 1000,
"end_ms": 3500,
"avg_pitch_hz": 180.5,
"avg_energy": 0.75
}Sent from the backend when a packet is transmitted, providing metadata for bandwidth visualization.
Type: "packet_summary"
Fields:
type(string, required): Always"packet_summary"bytes(integer, required): Packet size in bytesmode(string, required): Transmission mode ("semantic","text_only", or"morse")created_at_ms(integer, required): Packet creation timestamp in milliseconds
Example:
{
"type": "packet_summary",
"bytes": 142,
"mode": "semantic",
"created_at_ms": 1699123456789
}Transmission modes for Janus packets:
"semantic": Full semantic transmission (text + prosody data)"text_only": Text-only transmission (no prosody)"morse": Morse code transmission
Emotion override options:
"auto": Use prosody-extracted emotion (default)"relaxed": Force relaxed emotion"panicked": Force panicked emotion
- Connection: Frontend establishes WebSocket connection to
/ws/janus - Control Updates: Frontend sends
ControlMessageto update engine state - Event Streaming: Backend continuously sends
TranscriptMessageandPacketSummaryMessageevents - Disconnection: Either side can close the connection; backend handles cleanup gracefully
- Invalid message format: Backend logs error and continues processing
- WebSocket disconnection: Backend cancels tasks and cleans up resources
- Connection errors: Frontend should implement reconnection logic
Endpoint: GET /api/health
Returns the health status of the backend server.
Response:
{
"status": "ok"
}Status Codes:
200: Server is healthy
Endpoint: POST /api/voice/verify
Verifies and saves reference audio for voice cloning. Accepts an audio file, transcribes it, and verifies it matches the verification phrase ("The quick brown fox jumps over the lazy dog.").
Request:
- Method:
POST - Content-Type:
multipart/form-data - Body: Audio file (supports WAV, WebM, etc.)
Response (Success):
{
"status": "verified"
}Response (Failure):
{
"status": "failed",
"transcript": "actual transcribed text"
}Status Codes:
200: Request processed (checkstatusfield for verification result)
Verification:
- The audio file is transcribed using faster-whisper
- Transcript is compared to verification phrase using sequence matching
- Similarity threshold: 80% (0.8)
- If verified, audio is saved as
backend/reference_audio.wav
All WebSocket messages are JSON strings. The backend uses Pydantic models for validation, ensuring type safety and proper serialization.
- Frontend connects and sends initial control state
- Backend processes audio based on control state
- Backend sends transcript events as speech is detected and transcribed
- Backend sends packet summary events when packets are transmitted
- Frontend updates control state as user interacts with UI
- Backend responds to control updates by changing engine behavior
Control state is managed in backend/common/engine_state.py:
ControlStateobject holds current mode, flags, and emotion override- Updated atomically when
ControlMessageis received - Read by
smart_ear_loopto determine processing behavior
Event queues (transcript_queue, packet_queue) are used to decouple engine processing from WebSocket communication:
- Engine pushes events to queues
- WebSocket manager drains queues and forwards to frontend
- Async queues ensure non-blocking operation