Skip to content

TianqBu/Doppelvoice

Doppelvoice

Your voice, in any language. Real-time speech-to-speech translation with zero-shot voice cloning across 9 languages (Chinese / English / Japanese / Indonesian / Spanish / Portuguese / German / French + bilingual ZH⇄EN auto). The other party hears the target language in your own voice through any meeting app — Zoom, Teams, WeChat, Google Meet, OBS, anything that takes a microphone.

Powered by ByteDance Doubao Seed LiveInterpret 2.0.

中文 · Architecture · Setup · Troubleshooting

tests Python 3.10+ License: MIT Platform Release


What it does

You speak <source lang>  ─►  Doppelvoice  ─►  Peer hears <target lang> (in your voice)
   ┌──────────────────┐       ┌─────────┐       ┌──────────────────────────────┐
   │     your mic     │ ────► │ Doubao  │ ────► │ virtual mic → Zoom / Teams … │
   └──────────────────┘       │ AST 2.0 │       └──────────────────────────────┘
                              └─────────┘

Pick any of 9 source/target language codes (zh / en / ja / id / es / pt / de / fr) or use zhen on both sides for bilingual ZH⇄EN auto-detection.

End-to-end latency ≈ 2.5–3 s. Subtitles stream token-by-token; voice is cloned zero-shot from your speech as you talk.

Features

  • 🎙 End-to-end speech-to-speech — no separate STT / MT / TTS plumbing
  • 🗣 Zero-shot voice cloning — model captures your voice on the fly; explicit denoise=false to retain breath / resonance details
  • 🌐 9 languageszh / en / ja / id / es / pt / de / fr / zhen (the last one is the bilingual ZH⇄EN auto mode)
  • ~2.5 s latency — production-grade real-time
  • 🪟 Native Windows GUI (PySide6) with live bilingual subtitles
  • 🔌 Universal compatibility — anything that accepts a microphone works
  • 🔁 Automatic reconnect with exponential backoff and fatal-error classification
  • 🔒 Privacy-first defaults — translated audio and subtitles never persist to disk unless you opt in; logs auto-redact API keys and bearer tokens
  • 🧹 Clean device picker — one entry per physical device (host-API duplicates collapsed; MME 31-char name truncation handled)
  • 🛠 Configurable — sample rate, jitter buffer, RMS gate, denoise toggle, speaker_id, all tweakable

Demo

Doppelvoice GUI

Quick start

Two ways to install. Option A is the fastest (no Python needed).

Option A — Pre-built Windows binary (recommended)

  1. Install VB-Audio Virtual Cable → run installer as admin → reboot.
  2. Download the latest Doppelvoice-vX.Y.Z-win64.zip from the Releases page.
  3. Unzip anywhere, then inside the folder: copy .env.example.env, fill in DOUBAO_APP_KEY / DOUBAO_ACCESS_KEY (get them from the Volcengine Console).
  4. Double-click Doppelvoice.exe. The GUI opens.
  5. In your meeting app, set the microphone to CABLE Output (VB-Audio Virtual Cable).

Option B — From source (for developers)

git clone https://github.com/TianqBu/Doppelvoice.git
cd Doppelvoice
python -m venv .venv
.venv\Scripts\pip install -e .       :: installs from pyproject.toml
:: or: .venv\Scripts\pip install -r requirements.txt

copy .env.example .env
notepad .env       :: fill in DOUBAO_APP_KEY / DOUBAO_ACCESS_KEY

check.bat          :: verifies devices + API connectivity + StartSession
gui.bat            :: launches the GUI
run.bat            :: CLI mode

In your meeting app: pick CABLE Output (VB-Audio Virtual Cable) as the microphone.

CLI

run.bat                              :: start translation (CLI)
run.bat --gui                        :: launch GUI
run.bat --check                      :: self-check
run.bat --list-devices               :: list audio devices
run.bat --source en --target zh      :: reverse direction
run.bat --jitter-ms 80               :: lower latency (more underrun risk)
run.bat --log-level DEBUG            :: verbose logs

Configuration

All settings have sensible defaults. Override via .env or CLI flags.

Variable Default Notes
DOUBAO_APP_KEY / DOUBAO_ACCESS_KEY required from Volcengine console
DOUBAO_RESOURCE_ID volc.service_type.10053 AST 2.0 resource ID
SOURCE_LANG / TARGET_LANG zh / en one of zh / en / ja / id / es / pt / de / fr / zhen. Use zhen on both sides for bilingual ZH⇄EN auto mode.
MODE s2s s2s (speech→speech) or s2t (speech→text)
DENOISE 0 1 = server-side denoise on (cleaner input but flatter voice clone). 0 keeps breath / resonance for better cloning.
SPEAKER_ID empty Doubao ReqParams.speaker_id — empty = clone the speaker; set to a preset like zh_female_vv_uranus_bigtts to use a stock voice instead
INPUT_DEVICE / OUTPUT_DEVICE auto substring of device name (host API hidden; one entry per physical device)
LOG_LEVEL INFO DEBUG for verbose
DUMP_AUDIO false persist per-sentence ogg blobs (debug only)
LOG_SUBTITLE false persist subtitle text in logs (debug only)

Architecture

src/doppelvoice/
├── engine/        # Doubao AST 2.0 protobuf WebSocket client
├── audio/         # PortAudio (sounddevice) capture + playback + ogg/opus decoder
├── pipeline/      # asyncio orchestration: capture → ws → decode → playback
├── gui/           # PySide6 + qasync
├── cli.py
└── config.py

See docs/en/ARCHITECTURE.md for the full protocol details.

Tested with

  • Windows 10 / 11 x64
  • Python 3.10–3.12
  • VB-Audio Virtual Cable 1.0.4 (Driver Pack 43)
  • Zoom, 腾讯会议, 微信电话, Google Meet (Chrome), OBS

Known limitations

  1. Voice cloning quality varies with mic and clarity. AirPods over Bluetooth HFP (16 kHz narrowband phone mode) gives mediocre results — a wired/USB mic or laptop built-in mic is recommended. The default denoise=false already tells the server to keep your voice's unique characteristics; toggling it on in Settings would flatten the clone further.
  2. End-to-end latency floor ≈ 2.5 s is the model's hard limit per the Seed LiveInterpret 2.0 paper; local processing adds <500 ms.
  3. Voice expressiveness of the public AST API is good but not as lively as the Volcengine Console demo (which goes through a different BFF endpoint).
  4. Per-sentence audio decoding (ogg_opus) adds ~500 ms latency vs raw PCM (which the API does not currently honor).
  5. Use headphones, not speakers. With external speakers the meeting audio gets re-captured by your mic, re-translated, and sent back to the peer as their own translated voice — a textbook acoustic feedback loop. See Troubleshooting.

Privacy

  • API keys live only in .env (gitignored).
  • Translated audio and subtitle text are not persisted to disk by default.
  • Set DUMP_AUDIO=1 / LOG_SUBTITLE=1 for debugging only.
  • All audio is sent through ByteDance's Doubao API. Review their Terms of Service before use with sensitive content.

Contributing

PRs welcome. See CONTRIBUTING.md.

License

MIT.

Acknowledgements

About

Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors