--- frameworks: - "" --- # PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
English | 中文
📑 Paper | 🤗 HuggingFace | 🤖 ModelScope | 🎧 Demos
## News 📝 - **[2025.05]** Release Pilot-TTS base and instruct model weights ## Highlight 🔥 **PilotTTS** is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering. ### Key Features - **A fully open-source data processing pipeline:** We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs. - **Content Consistency and Speaker Similarity Control:** On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%). - **Emotion and Paralinguistic Control:** Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH). - **Dialect Control:** Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect. ## Installation ⚙️ ### Clone and install ```bash git clone https://github.com/xxx/pilot-tts.git cd pilot-tts ``` ### Environment setup ```bash conda create -n pilot-tts python=3.10 -y conda activate pilot-tts pip install -r requirements.txt ``` ### Model download #### 1. Pilot-TTS models (our weights) ```python # ModelScope from modelscope import snapshot_download snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/') # HuggingFace from huggingface_hub import snapshot_download snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/') ``` This includes: `pilot_tts.pt`, `pilot_tts_instruct.pt`, and `tokenizer/`. #### 2. Third-party open-source models Download the following dependencies from their respective open-source projects: ```python from modelscope import snapshot_download # Qwen3-0.6B (LLM backbone) snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B') # CosyVoice3 (flow-matching vocoder, includes campplus.onnx) snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B') ``` ```python from huggingface_hub import snapshot_download # w2v-bert-2.0 (audio feature extractor) snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0') ``` > Note: `wav2vec2bert_stats.pt` (from [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)) is included in the Pilot-TTS model package. #### Final directory structure ``` pretrained_models/ ├── pilot_tts.pt # Base model (zero-shot voice cloning) ├── pilot_tts_instruct.pt # Instruct model (emotion, paralanguage, dialect) ├── Qwen3-0.6B/ # LLM backbone (from Qwen) ├── w2v-bert-2.0/ # Audio feature extractor (from Meta) ├── wav2vec2bert_stats.pt # Feature normalization stats (from MaskGCT) └── CosyVoice3-0.5B/ # Flow-matching vocoder (from FunAudioLLM) ``` ## Quick Start 📖 Run all inference demos with a single command: ```bash python demo.py ``` ## Inference ### Python API ```python from demo import load_engine, synthesize # Zero-shot voice cloning (base model) engine = load_engine( config_path="configs/infer_pilot_tts.yaml", checkpoint="pretrained_models/pilot_tts.pt", ) synthesize(engine, text="你好,世界!", prompt_wav="assert/prompt.wav", output_path="output/clone.wav") # Load instruct model (emotion, paralanguage, dialect) engine_instruct = load_engine( config_path="configs/infer_pilot_tts_instruct.yaml", checkpoint="pretrained_models/pilot_tts_instruct.pt", ) # Emotion synthesis synthesize(engine_instruct, text="今天天气真好啊!", prompt_wav="assert/prompt.wav", emotion="happy", output_path="output/happy.wav") # Paralanguage synthesize(engine_instruct, text="这太好笑了<|LAUGH|>停不下来", prompt_wav="assert/prompt.wav", output_path="output/laugh.wav") # Dialect (Henan) synthesize(engine_instruct, text="中不中啊,咱俩一块儿去吃胡辣汤吧", prompt_wav="assert/prompt.wav", language="zh-henan", output_path="output/henan.wav") ``` ### Command Line ```bash # Zero-shot voice cloning (base model) python inference.py \ --checkpoint pretrained_models/pilot_tts.pt \ --prompt-wav assert/prompt.wav \ --text "需要合成的目标文本" \ --output output/zeroshot.wav # Emotion synthesis (instruct model) python inference.py \ --config configs/infer_pilot_tts_instruct.yaml \ --checkpoint pretrained_models/pilot_tts_instruct.pt \ --prompt-wav assert/prompt.wav \ --text "今天天气真好啊,我们去公园玩吧!" \ --emotion happy \ --output output/emotion.wav # Paralanguage (instruct model) python inference.py \ --config configs/infer_pilot_tts_instruct.yaml \ --checkpoint pretrained_models/pilot_tts_instruct.pt \ --prompt-wav assert/prompt.wav \ --text "这个笑话太好笑了<|LAUGH|>我真的忍不住" \ --output output/paralang.wav # Dialect synthesis (instruct model) python inference.py \ --config configs/infer_pilot_tts_instruct.yaml \ --checkpoint pretrained_models/pilot_tts_instruct.pt \ --prompt-wav assert/prompt.wav \ --text "中不中啊,咱俩一块儿去吃胡辣汤吧" \ --language zh-henan \ --output output/dialect.wav ``` ### Supported Controls | Feature | Usage | Model | |---------|-------|-------| | Voice Cloning | Provide prompt audio | Both | | Emotions | `--emotion