| --- |
| frameworks: |
| - "" |
| --- |
| # PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis |
|
|
| <div align="center"> |
| <img src="assert/Introduction.png" width="600" /> |
| </div> |
|
|
| <p align="center"> |
| English | <a href="README_zh.md">中文</a> |
| </p> |
| |
| <p align="center"> |
| 📑 <a href="#">Paper</a> | 🤗 <a href="https://huggingface.co/AmapVoice/PilotTTS">HuggingFace</a> | 🤖 <a href="https://www.modelscope.cn/models/AmapVoice/PilotTTS">ModelScope</a> | 🎧 <a href="https://amapvoice.github.io/PilotTTS/">Demos</a> |
| </p> |
| |
|
|
| ## News 📝 |
|
|
| - **[2025.05]** Release Pilot-TTS base and instruct model weights |
|
|
| ## Highlight 🔥 |
|
|
| **PilotTTS** is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering. |
|
|
| ### Key Features |
| - **A fully open-source data processing pipeline:** We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs. |
| - **Content Consistency and Speaker Similarity Control:** On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%). |
| - **Emotion and Paralinguistic Control:** Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH). |
| - **Dialect Control:** Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect. |
|
|
| ## Installation ⚙️ |
|
|
| ### Clone and install |
|
|
| ```bash |
| git clone https://github.com/xxx/pilot-tts.git |
| cd pilot-tts |
| ``` |
|
|
| ### Environment setup |
|
|
| ```bash |
| conda create -n pilot-tts python=3.10 -y |
| conda activate pilot-tts |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Model download |
|
|
| #### 1. Pilot-TTS models (our weights) |
|
|
| ```python |
| # ModelScope |
| from modelscope import snapshot_download |
| snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/') |
| |
| # HuggingFace |
| from huggingface_hub import snapshot_download |
| snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/') |
| ``` |
|
|
| This includes: `pilot_tts.pt`, `pilot_tts_instruct.pt`, and `tokenizer/`. |
|
|
| #### 2. Third-party open-source models |
|
|
| Download the following dependencies from their respective open-source projects: |
|
|
| ```python |
| from modelscope import snapshot_download |
| |
| # Qwen3-0.6B (LLM backbone) |
| snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B') |
| |
| # CosyVoice3 (flow-matching vocoder, includes campplus.onnx) |
| snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B') |
| ``` |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| # w2v-bert-2.0 (audio feature extractor) |
| snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0') |
| ``` |
|
|
| > Note: `wav2vec2bert_stats.pt` (from [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)) is included in the Pilot-TTS model package. |
| |
| #### Final directory structure |
| |
| ``` |
| pretrained_models/ |
| ├── pilot_tts.pt # Base model (zero-shot voice cloning) |
| ├── pilot_tts_instruct.pt # Instruct model (emotion, paralanguage, dialect) |
| ├── Qwen3-0.6B/ # LLM backbone (from Qwen) |
| ├── w2v-bert-2.0/ # Audio feature extractor (from Meta) |
| ├── wav2vec2bert_stats.pt # Feature normalization stats (from MaskGCT) |
| └── CosyVoice3-0.5B/ # Flow-matching vocoder (from FunAudioLLM) |
| ``` |
| |
| ## Quick Start 📖 |
| |
| Run all inference demos with a single command: |
| |
| ```bash |
| python demo.py |
| ``` |
| |
| ## Inference |
| |
| ### Python API |
| |
| ```python |
| from demo import load_engine, synthesize |
| |
| # Zero-shot voice cloning (base model) |
| engine = load_engine( |
| config_path="configs/infer_pilot_tts.yaml", |
| checkpoint="pretrained_models/pilot_tts.pt", |
| ) |
| synthesize(engine, text="你好,世界!", |
| prompt_wav="assert/prompt.wav", |
| output_path="output/clone.wav") |
| |
| # Load instruct model (emotion, paralanguage, dialect) |
| engine_instruct = load_engine( |
| config_path="configs/infer_pilot_tts_instruct.yaml", |
| checkpoint="pretrained_models/pilot_tts_instruct.pt", |
| ) |
| |
| # Emotion synthesis |
| synthesize(engine_instruct, text="今天天气真好啊!", |
| prompt_wav="assert/prompt.wav", |
| emotion="happy", output_path="output/happy.wav") |
| |
| # Paralanguage |
| synthesize(engine_instruct, text="这太好笑了<|LAUGH|>停不下来", |
| prompt_wav="assert/prompt.wav", |
| output_path="output/laugh.wav") |
| |
| # Dialect (Henan) |
| synthesize(engine_instruct, text="中不中啊,咱俩一块儿去吃胡辣汤吧", |
| prompt_wav="assert/prompt.wav", |
| language="zh-henan", output_path="output/henan.wav") |
| ``` |
| |
| ### Command Line |
|
|
| ```bash |
| # Zero-shot voice cloning (base model) |
| python inference.py \ |
| --checkpoint pretrained_models/pilot_tts.pt \ |
| --prompt-wav assert/prompt.wav \ |
| --text "需要合成的目标文本" \ |
| --output output/zeroshot.wav |
| |
| # Emotion synthesis (instruct model) |
| python inference.py \ |
| --config configs/infer_pilot_tts_instruct.yaml \ |
| --checkpoint pretrained_models/pilot_tts_instruct.pt \ |
| --prompt-wav assert/prompt.wav \ |
| --text "今天天气真好啊,我们去公园玩吧!" \ |
| --emotion happy \ |
| --output output/emotion.wav |
| |
| # Paralanguage (instruct model) |
| python inference.py \ |
| --config configs/infer_pilot_tts_instruct.yaml \ |
| --checkpoint pretrained_models/pilot_tts_instruct.pt \ |
| --prompt-wav assert/prompt.wav \ |
| --text "这个笑话太好笑了<|LAUGH|>我真的忍不住" \ |
| --output output/paralang.wav |
| |
| # Dialect synthesis (instruct model) |
| python inference.py \ |
| --config configs/infer_pilot_tts_instruct.yaml \ |
| --checkpoint pretrained_models/pilot_tts_instruct.pt \ |
| --prompt-wav assert/prompt.wav \ |
| --text "中不中啊,咱俩一块儿去吃胡辣汤吧" \ |
| --language zh-henan \ |
| --output output/dialect.wav |
| ``` |
|
|
| ### Supported Controls |
|
|
| | Feature | Usage | Model | |
| |---------|-------|-------| |
| | Voice Cloning | Provide prompt audio | Both | |
| | Emotions | `--emotion <tag>` | Instruct | |
| | Paralanguage | Insert tags in text | Instruct | |
| | Dialects | `--language <dialect>` | Instruct | |
|
|
| **Emotions:** |
|
|
| | Tag | 情感 | Tag | 情感 | |
| |-----|------|-----|------| |
| | `happy` | 开心 | `sad` | 悲伤 | |
| | `angry` | 愤怒 | `surprise` | 惊讶 | |
| | `fear` | 恐惧 | `disgust` | 厌恶 | |
| | `serious` | 严肃 | `concern` | 关切 | |
| | `blue` | 忧郁 | `disdain` | 轻蔑 | |
| | `neutral` | 中性/平静 | `psychology` | 心理活动 | |
| | `unknown` | 不指定情感 | | | |
|
|
| **Paralanguage tags:** |
|
|
| | Tag | Description | |
| |-----|-------------| |
| | `<\|LAUGH\|>` | 笑声 | |
| | `<\|BREATH\|>` | 呼吸声 | |
| | `<\|COUGH\|>` | 咳嗽 | |
| | `<\|CRY\|>` | 哭泣声 | |
| | `<\|LAUGH_SPAN\|>...<\|/LAUGH_SPAN\|>` | 包裹笑声文本 | |
|
|
| **Dialects:** |
|
|
| | Tag | 方言 | Tag | 方言 | |
| |-----|------|-----|------| |
| | `zh-dongbei` | 东北话 | `zh-shandong` | 山东话 | |
| | `zh-henan` | 河南话 | `zh-shan1xi` | 山西话 | |
| | `zh-minnan` | 闽南语 | `zh-gansu` | 甘肃话 | |
| | `zh-ningxia` | 宁夏话 | `zh-shanghai` | 上海话 | |
| | `zh-chongqing` | 重庆话 | `zh-hubei` | 湖北话 | |
| | `zh-hunan` | 湖南话 | `zh-jiangxi` | 江西话 | |
| | `zh-guizhou` | 贵州话 | `zh-yunnan` | 云南话 | |
|
|
| ## WebUI |
|
|
| Launch a Gradio-based interactive interface: |
|
|
| ```bash |
| python webui.py --port 9000 |
| ``` |
|
|
| ## Project Structure |
|
|
| ``` |
| pilot-tts/ |
| ├── configs/ # Inference configurations (per checkpoint) |
| ├── demo.py # Complete demo (all inference modes) |
| ├── inference.py # CLI inference entry |
| ├── webui.py # Gradio WebUI |
| ├── asset/ # Example prompt audio |
| ├── pilot_voice/ # Core model code |
| │ ├── engine.py # InferenceEngine pipeline |
| │ ├── model.py # AR model (Qwen3 backbone + audio tokens) |
| │ ├── sampling.py # RAS sampling (from VALL-E 2) |
| │ ├── utils.py # Utilities |
| │ ├── modules/ # Conformer + Perceiver modules |
| │ └── tools/ # Audio & text processing |
| ├── third_party/ |
| │ ├── cosyvoice/ # Flow-matching vocoder |
| │ └── Matcha-TTS/ # Flow matching dependency |
| ├── tokenizer/ # Custom tokenizer with special tokens |
| ├── pretrained_models/ # Model weights (not in git) |
| └── requirements.txt |
| ``` |
|
|
| ## Acknowledgements |
|
|
| - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) — Flow-matching & Vocoder |
| - [Qwen3](https://github.com/QwenLM/Qwen3) — LLM backbone |
| - [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) — Flow matching framework |
| - [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) — wav2vec2bert feature statistics |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{pilottts2025, |
| title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis}, |
| author={}, |
| year={2025}, |
| journal={arXiv preprint arXiv:xxxx.xxxxx} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache-2.0 |
|
|
|
|