Skip to content

Latest commit

 

History

History
220 lines (155 loc) · 7.35 KB

File metadata and controls

220 lines (155 loc) · 7.35 KB

FlashTTS

Powered by state-of-the-art models such as SparkTTS, OrpheusTTS, MegaTTS 3, FlashTTS delivers high-quality Mandarin speech synthesis and zero-shot voice cloning. With a clean and intuitive Web interface, you can quickly generate natural, lifelike voices for dubbing, narration, accessibility, virtual characters, and more.

If you find FlashTTS helpful, please leave us a ⭐ Star!

✨ Highlights

Feature Description
🚀 Multi-backend Acceleration Supports high-performance inference engines like vllm, sglang, llama-cpp, mlx-lm,tensorrt-llm, etc.
🎯 High Concurrency Dynamic batching and asynchronous queues to handle heavy traffic with ease
🎛️ Full Parameter Control Adjust pitch, speaking rate, temperature, emotion tags, and more
📱 Lightweight Deployment Built on FastAPI—start with a single command; minimal dependencies
🔊 Long-form Synthesis Supports very long texts while maintaining consistent voice quality
🔄 Streaming TTS Generate and play audio in real time; reduces wait time, enhances interactivity
🎭 Multi-character Dialog Synthesize multiple roles within the same text—ideal for script dubbing
🎨 Modern Frontend Web-ready, responsive interface

🖼️ Frontend Demo

FastTTS.mp4

🔈 Voice Samples

Below are demos showcasing FlashTTS’s cloning capabilities across different models and characters.

SparkTTS Model

Donald Trump (EN)
Listen

Donald Trump (ZH)
Listen

Nezha
Listen

Li Jing
Listen

Yu Chengdong
Listen

Xu Zhisheng
Listen

MegaTTS 3 Model

Cai Xukun
Listen

Taiyi Zhenren
Listen

OrpheusTTS (ZH) Model

Changle
Listen

Baizhi
Listen

Quick Start

It is recommended to install flashtts in a Python 3.8–3.12 environment via pip:

pip install flashtts

For detailed installation steps, please refer to: installation guide

Local inference command::

flashtts infer \
  -i "hello world." \
  -o output.wav \
  -m ./models/your_model \
  -b vllm \
  [other optional parameters]

For detailed usage,please refer to: quick_start.md

Server deployment:

 flashtts serve \
 --model_path Spark-TTS-0.5B \ 
 --backend vllm \ 
 --role_dir data/roles \
 --llm_device cuda \
 --tokenizer_device cuda \
 --detokenizer_device cuda \
 --wav2vec_attn_implementation sdpa \
 --llm_attn_implementation sdpa \ 
 --torch_dtype "bfloat16" \ 
 --max_length 32768 \
 --llm_gpu_memory_utilization 0.6 \
 --fix_voice \  # Whether to fix the spark-tts timbre (female and male)
 --host 0.0.0.0 \
 --port 8000

Web address: http://localhost:8000

Interface document address: http://localhost:8000/docs

For detailed deployment,please refer to: server.md

⚡ Inference Speed

Test environment: A800 GPU · Model: Spark-TTS-0.5B · Test script: speed_test.py

Scenario Engine Device Audio Length (s) Inference Time (s) RTF
Short llama-cpp CPU 7.48 6.81 0.91
Short torch GPU 7.18 7.68 1.07
Short vllm GPU 7.24 1.66 0.23
Short sglang GPU 7.58 1.07 0.14
Long llama-cpp CPU 121.98 117.83 0.97
Long torch GPU 113.70 107.17 0.94
Long vllm GPU 111.82 7.28 0.07
Long sglang GPU 117.02 4.20 0.04

RTF < 1 means real-time synthesis.

⚙️ Usage Tips

  1. SparkTTS weights must be bfloat16 or float32; using float16 will cause errors.
  2. If you experience long silent gaps, try increasing repetition_penalty (> 1.0).
  3. OrpheusTTS supports inserting <tag> in text to control emotion. See LANG_MAP in orpheus_engine.py.
  4. For safety reasons, MegaTTS 3 does not publish the WaveVAE encoder. Please follow the official instructions to download it: reference audio.

🤝 Acknowledgments

⚠️ Disclaimer

FlashTTS is provided for academic research, education, and lawful purposes only, such as accessibility assistance and personalized speech synthesis. Do not use it for fraud, impersonation, deepfakes, or other illegal activities. Users are responsible for any misuse.

License

This project follows the same license as Spark-TTS. See LICENSE for details.

Star History

Star History Chart