Janus is a real-time semantic audio codec system designed to optimize bandwidth by transmitting semantic meaning rather than raw audio waveforms. Instead of sending compressed audio data, Janus converts speech to text, extracts prosodic metadata (pitch and energy), and reconstructs the voice on the receiver side using generative text-to-speech synthesis. This approach enables high-quality voice communication over extremely constrained network connections (as low as 300bps), where it is reconstructed into natural-sounding speech using modern generative TTS models.
Janus is an implementation of concepts introduced in SemantiCodec (Liu et al., 2024), a state-of-the-art semantic codec that demonstrated the viability of sub-kbps speech transmission. While the original paper focuses on diffusion-based reconstruction, Janus adapts this for real-time latency by substituting the diffusion decoder with a faster LLM-based instruction pipeline.
Janus extends into a real-time system by leveraging an end-to-end STT → semantic-packet → TTS pipeline:
- STT Layer (Faster-Whisper)
- Prosody Layer (Aubio) Captures pitch and energy to preserve vocal tone.
- Compression Layer (MessagePack) Packages text and prosody into ~300 bps payloads.
- Reconstruction Layer (FishAudio TTS) Generates natural speech using the received semantic instructions.
Traditional codecs bind audio quality to bandwidth: less data means worse sound. Janus sidesteps this by transmitting text and prosody instructions instead of audio. This allows for crystal-clear voice reconstruction at 300bps, regardless of network conditions.
- Operating Bitrate: 300 bits per second (bps)
- Comparison to VoIP: ~20x more efficient than standard VoIP codecs like Opus (which requires minimum ~6 kbps for robust operation)
- Comparison to SOTA Codecs: 5-10x more efficient than state-of-the-art neural waveform codecs (Lyra/EnCodec, which reach a physical compression floor at ~1.5-3 kbps)
Pricing Comparison: Janus achieves a 158x cost reduction for critical satellite communication
- Standard Satellite Voice (Iridium Land): ~$0.89 per minute
- Janus Semantic Voice (Iridium Certus Data): ~$0.0056 per event
Operational Impact: For industrial users operating remote fleets, this nearly eliminates vocal communication expenses
- Standard Voice OPEX: $13,350/month for a single fleet
- Semantic Voice OPEX: $84/month for the same fleet
Public Safety and Disaster Relief
- Reliable communication when infrastructure fails during mass casualty events (Maui wildfires, Hurricane Helene)
- Crystal-clear synthesized audio reduces cognitive load on first responders
Global South and Rural Connectivity
- Voice over ultra-low-power networks (LoRaWAN, LPWAN) where high-bandwidth is unviable
- Addresses digital divide in underserved regions
Maritime Communications
- Primary/backup voice over expensive L-band satellites (Iridium/Inmarsat)
- Eliminates economic friction discouraging detailed voice exchanges at sea
Smart Mining Operations
- Coordinates supervisors in remote surface operations
- Maintains communication in subterranean GPS-denied environments
Low-Power/Off-Grid IoT
- Voice commands on battery-powered devices and sensor networks
- Complies with strict regulatory duty cycle limits (1% Europe) impossible for continuous voice
- SETUP.md: Environment setup, installation, and start instructions
- ARCHITECTURE.md: Architecture, tech stack, and design decisions
- API.md: WebSocket and REST API reference
- TESTING.md: Testing guidelines
- STYLE.md: Coding standards
See LICENSE file for details.



