Improve model card metadata and content
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,116 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
|
| 2 |
|
| 3 |
<div align="center">
|
| 4 |
<img src="assert/Introduction.png" width="600" />
|
| 5 |
</div>
|
| 6 |
|
| 7 |
-
|
| 8 |
-
English | <a href="README_zh.md">中文</a>
|
| 9 |
-
</p>
|
| 10 |
-
|
| 11 |
-
<p align="center">
|
| 12 |
-
📑 <a href="#">Paper</a> | 🤗 <a href="https://huggingface.co/AmapVoice/PilotTTS">HuggingFace</a> | 🤖 <a href="https://www.modelscope.cn/models/AmapVoice/PilotTTS">ModelScope</a> | 🎧 <a href="https://amapvoice.github.io/PilotTTS/">Demos</a>
|
| 13 |
-
</p>
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
- **
|
| 19 |
|
| 20 |
## Highlight 🔥
|
| 21 |
|
| 22 |
-
**
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
- **
|
| 26 |
-
- **Content Consistency and Speaker Similarity Control:** On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).
|
| 27 |
-
- **Emotion and Paralinguistic Control:** Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
|
| 28 |
-
- **Dialect Control:** Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.
|
| 29 |
-
|
| 30 |
-
## Installation ⚙️
|
| 31 |
-
|
| 32 |
-
### Clone and install
|
| 33 |
-
|
| 34 |
-
```bash
|
| 35 |
-
git clone https://github.com/xxx/pilot-tts.git
|
| 36 |
-
cd pilot-tts
|
| 37 |
-
```
|
| 38 |
|
| 39 |
-
##
|
| 40 |
|
| 41 |
```bash
|
|
|
|
|
|
|
| 42 |
conda create -n pilot-tts python=3.10 -y
|
| 43 |
conda activate pilot-tts
|
| 44 |
pip install -r requirements.txt
|
| 45 |
```
|
| 46 |
|
| 47 |
-
##
|
| 48 |
-
|
| 49 |
-
#### 1. Pilot-TTS models (our weights)
|
| 50 |
-
|
| 51 |
-
```python
|
| 52 |
-
# ModelScope
|
| 53 |
-
from modelscope import snapshot_download
|
| 54 |
-
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
|
| 55 |
-
|
| 56 |
-
# HuggingFace
|
| 57 |
-
from huggingface_hub import snapshot_download
|
| 58 |
-
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
This includes: `pilot_tts.pt`, `pilot_tts_instruct.pt`, and `tokenizer/`.
|
| 62 |
-
|
| 63 |
-
#### 2. Third-party open-source models
|
| 64 |
-
|
| 65 |
-
Download the following dependencies from their respective open-source projects:
|
| 66 |
|
| 67 |
-
|
| 68 |
-
from modelscope import snapshot_download
|
| 69 |
-
|
| 70 |
-
# Qwen3-0.6B (LLM backbone)
|
| 71 |
-
snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B')
|
| 72 |
-
|
| 73 |
-
# CosyVoice3 (flow-matching vocoder, includes campplus.onnx)
|
| 74 |
-
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B')
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
```python
|
| 78 |
-
from huggingface_hub import snapshot_download
|
| 79 |
-
|
| 80 |
-
# w2v-bert-2.0 (audio feature extractor)
|
| 81 |
-
snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0')
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
> Note: `wav2vec2bert_stats.pt` (from [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)) is included in the Pilot-TTS model package.
|
| 85 |
-
|
| 86 |
-
#### Final directory structure
|
| 87 |
-
|
| 88 |
-
```
|
| 89 |
-
pretrained_models/
|
| 90 |
-
├── pilot_tts.pt # Base model (zero-shot voice cloning)
|
| 91 |
-
├── pilot_tts_instruct.pt # Instruct model (emotion, paralanguage, dialect)
|
| 92 |
-
├── Qwen3-0.6B/ # LLM backbone (from Qwen)
|
| 93 |
-
├── w2v-bert-2.0/ # Audio feature extractor (from Meta)
|
| 94 |
-
├── wav2vec2bert_stats.pt # Feature normalization stats (from MaskGCT)
|
| 95 |
-
└── CosyVoice3-0.5B/ # Flow-matching vocoder (from FunAudioLLM)
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
## Quick Start 📖
|
| 99 |
-
|
| 100 |
-
Run all inference demos with a single command:
|
| 101 |
-
|
| 102 |
-
```bash
|
| 103 |
-
python demo.py
|
| 104 |
-
```
|
| 105 |
-
|
| 106 |
-
## Inference
|
| 107 |
-
|
| 108 |
-
### Python API
|
| 109 |
|
| 110 |
```python
|
| 111 |
from demo import load_engine, synthesize
|
| 112 |
|
| 113 |
-
# Zero-shot voice cloning (base model)
|
| 114 |
engine = load_engine(
|
| 115 |
config_path="configs/infer_pilot_tts.yaml",
|
| 116 |
checkpoint="pretrained_models/pilot_tts.pt",
|
|
@@ -119,158 +48,24 @@ synthesize(engine, text="你好,世界!",
|
|
| 119 |
prompt_wav="assert/prompt.wav",
|
| 120 |
output_path="output/clone.wav")
|
| 121 |
|
| 122 |
-
#
|
| 123 |
engine_instruct = load_engine(
|
| 124 |
config_path="configs/infer_pilot_tts_instruct.yaml",
|
| 125 |
checkpoint="pretrained_models/pilot_tts_instruct.pt",
|
| 126 |
)
|
| 127 |
|
| 128 |
-
# Emotion synthesis
|
| 129 |
synthesize(engine_instruct, text="今天天气真好啊!",
|
| 130 |
prompt_wav="assert/prompt.wav",
|
| 131 |
emotion="happy", output_path="output/happy.wav")
|
| 132 |
-
|
| 133 |
-
# Paralanguage
|
| 134 |
-
synthesize(engine_instruct, text="这太好笑了<|LAUGH|>停不下来",
|
| 135 |
-
prompt_wav="assert/prompt.wav",
|
| 136 |
-
output_path="output/laugh.wav")
|
| 137 |
-
|
| 138 |
-
# Dialect (Henan)
|
| 139 |
-
synthesize(engine_instruct, text="中不中啊,咱俩一块儿去吃胡辣汤吧",
|
| 140 |
-
prompt_wav="assert/prompt.wav",
|
| 141 |
-
language="zh-henan", output_path="output/henan.wav")
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
### Command Line
|
| 145 |
-
|
| 146 |
-
```bash
|
| 147 |
-
# Zero-shot voice cloning (base model)
|
| 148 |
-
python inference.py \
|
| 149 |
-
--checkpoint pretrained_models/pilot_tts.pt \
|
| 150 |
-
--prompt-wav assert/prompt.wav \
|
| 151 |
-
--text "需要合成的目标文本" \
|
| 152 |
-
--output output/zeroshot.wav
|
| 153 |
-
|
| 154 |
-
# Emotion synthesis (instruct model)
|
| 155 |
-
python inference.py \
|
| 156 |
-
--config configs/infer_pilot_tts_instruct.yaml \
|
| 157 |
-
--checkpoint pretrained_models/pilot_tts_instruct.pt \
|
| 158 |
-
--prompt-wav assert/prompt.wav \
|
| 159 |
-
--text "今天天气真好啊,我们去公园玩吧!" \
|
| 160 |
-
--emotion happy \
|
| 161 |
-
--output output/emotion.wav
|
| 162 |
-
|
| 163 |
-
# Paralanguage (instruct model)
|
| 164 |
-
python inference.py \
|
| 165 |
-
--config configs/infer_pilot_tts_instruct.yaml \
|
| 166 |
-
--checkpoint pretrained_models/pilot_tts_instruct.pt \
|
| 167 |
-
--prompt-wav assert/prompt.wav \
|
| 168 |
-
--text "这个笑话太好笑了<|LAUGH|>我真的忍不住" \
|
| 169 |
-
--output output/paralang.wav
|
| 170 |
-
|
| 171 |
-
# Dialect synthesis (instruct model)
|
| 172 |
-
python inference.py \
|
| 173 |
-
--config configs/infer_pilot_tts_instruct.yaml \
|
| 174 |
-
--checkpoint pretrained_models/pilot_tts_instruct.pt \
|
| 175 |
-
--prompt-wav assert/prompt.wav \
|
| 176 |
-
--text "中不中啊,咱俩一块儿去吃胡辣汤吧" \
|
| 177 |
-
--language zh-henan \
|
| 178 |
-
--output output/dialect.wav
|
| 179 |
```
|
| 180 |
|
| 181 |
-
### Supported Controls
|
| 182 |
-
|
| 183 |
-
| Feature | Usage | Model |
|
| 184 |
-
|---------|-------|-------|
|
| 185 |
-
| Voice Cloning | Provide prompt audio | Both |
|
| 186 |
-
| Emotions | `--emotion <tag>` | Instruct |
|
| 187 |
-
| Paralanguage | Insert tags in text | Instruct |
|
| 188 |
-
| Dialects | `--language <dialect>` | Instruct |
|
| 189 |
-
|
| 190 |
-
**Emotions:**
|
| 191 |
-
|
| 192 |
-
| Tag | 情感 | Tag | 情感 |
|
| 193 |
-
|-----|------|-----|------|
|
| 194 |
-
| `happy` | 开心 | `sad` | 悲伤 |
|
| 195 |
-
| `angry` | 愤怒 | `surprise` | 惊讶 |
|
| 196 |
-
| `fear` | 恐惧 | `disgust` | 厌恶 |
|
| 197 |
-
| `serious` | 严肃 | `concern` | 关切 |
|
| 198 |
-
| `blue` | 忧郁 | `disdain` | 轻蔑 |
|
| 199 |
-
| `neutral` | 中性/平静 | `psychology` | 心理活动 |
|
| 200 |
-
| `unknown` | 不指定情感 | | |
|
| 201 |
-
|
| 202 |
-
**Paralanguage tags:**
|
| 203 |
-
|
| 204 |
-
| Tag | Description |
|
| 205 |
-
|-----|-------------|
|
| 206 |
-
| `<\|LAUGH\|>` | 笑声 |
|
| 207 |
-
| `<\|BREATH\|>` | 呼吸声 |
|
| 208 |
-
| `<\|COUGH\|>` | 咳嗽 |
|
| 209 |
-
| `<\|CRY\|>` | 哭泣声 |
|
| 210 |
-
| `<\|LAUGH_SPAN\|>...<\|/LAUGH_SPAN\|>` | 包裹笑声文本 |
|
| 211 |
-
|
| 212 |
-
**Dialects:**
|
| 213 |
-
|
| 214 |
-
| Tag | 方言 | Tag | 方言 |
|
| 215 |
-
|-----|------|-----|------|
|
| 216 |
-
| `zh-dongbei` | 东北话 | `zh-shandong` | 山东话 |
|
| 217 |
-
| `zh-henan` | 河南话 | `zh-shan1xi` | 山西话 |
|
| 218 |
-
| `zh-minnan` | 闽南语 | `zh-gansu` | 甘肃话 |
|
| 219 |
-
| `zh-ningxia` | 宁夏话 | `zh-shanghai` | 上海话 |
|
| 220 |
-
| `zh-chongqing` | 重庆话 | `zh-hubei` | 湖北话 |
|
| 221 |
-
| `zh-hunan` | 湖南话 | `zh-jiangxi` | 江西话 |
|
| 222 |
-
| `zh-guizhou` | 贵州话 | `zh-yunnan` | 云南话 |
|
| 223 |
-
|
| 224 |
-
## WebUI
|
| 225 |
-
|
| 226 |
-
Launch a Gradio-based interactive interface:
|
| 227 |
-
|
| 228 |
-
```bash
|
| 229 |
-
python webui.py --port 9000
|
| 230 |
-
```
|
| 231 |
-
|
| 232 |
-
## Project Structure
|
| 233 |
-
|
| 234 |
-
```
|
| 235 |
-
pilot-tts/
|
| 236 |
-
├── configs/ # Inference configurations (per checkpoint)
|
| 237 |
-
├── demo.py # Complete demo (all inference modes)
|
| 238 |
-
├── inference.py # CLI inference entry
|
| 239 |
-
├── webui.py # Gradio WebUI
|
| 240 |
-
├── asset/ # Example prompt audio
|
| 241 |
-
├── pilot_voice/ # Core model code
|
| 242 |
-
│ ├── engine.py # InferenceEngine pipeline
|
| 243 |
-
│ ├── model.py # AR model (Qwen3 backbone + audio tokens)
|
| 244 |
-
│ ├── sampling.py # RAS sampling (from VALL-E 2)
|
| 245 |
-
│ ├── utils.py # Utilities
|
| 246 |
-
│ ├── modules/ # Conformer + Perceiver modules
|
| 247 |
-
│ └── tools/ # Audio & text processing
|
| 248 |
-
├── third_party/
|
| 249 |
-
│ ├── cosyvoice/ # Flow-matching vocoder
|
| 250 |
-
│ └── Matcha-TTS/ # Flow matching dependency
|
| 251 |
-
├── tokenizer/ # Custom tokenizer with special tokens
|
| 252 |
-
├── pretrained_models/ # Model weights (not in git)
|
| 253 |
-
└── requirements.txt
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
## Acknowledgements
|
| 257 |
-
|
| 258 |
-
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) — Flow-matching & Vocoder
|
| 259 |
-
- [Qwen3](https://github.com/QwenLM/Qwen3) — LLM backbone
|
| 260 |
-
- [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) — Flow matching framework
|
| 261 |
-
- [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) — wav2vec2bert feature statistics
|
| 262 |
-
|
| 263 |
## Citation
|
| 264 |
|
| 265 |
```bibtex
|
| 266 |
-
@article{
|
| 267 |
title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
|
| 268 |
-
author={},
|
| 269 |
-
year={
|
| 270 |
-
journal={arXiv preprint arXiv:
|
| 271 |
}
|
| 272 |
-
```
|
| 273 |
-
|
| 274 |
-
## License
|
| 275 |
-
|
| 276 |
-
Apache-2.0
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-to-speech
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
|
| 7 |
|
| 8 |
<div align="center">
|
| 9 |
<img src="assert/Introduction.png" width="600" />
|
| 10 |
</div>
|
| 11 |
|
| 12 |
+
PilotTTS is a lightweight autoregressive text-to-speech (TTS) system that achieves competitive performance through minimalist architecture and rigorous data engineering. It supports zero-shot voice cloning, emotion synthesis, paralinguistic synthesis, and various Chinese dialects.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
- **Paper:** [PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis](https://arxiv.org/abs/2605.27258)
|
| 15 |
+
- **Code:** [GitHub Repository](https://github.com/AMAPVOICE/PilotTTS)
|
| 16 |
+
- **Demos:** [Project Page](https://amapvoice.github.io/PilotTTS/)
|
| 17 |
|
| 18 |
## Highlight 🔥
|
| 19 |
|
| 20 |
+
- **A fully open-source data processing pipeline:** Converts large-scale Internet audio into clean training data with rich annotation using publicly available tools.
|
| 21 |
+
- **Content Consistency and Speaker Similarity:** Achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%) on Seed-TTS benchmarks.
|
| 22 |
+
- **Controllable Synthesis:** Supports 11 emotion categories (e.g., Happy, Sad, Angry) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
|
| 23 |
+
- **Dialect Support:** Supports 14 Chinese dialects and enables cross-dialect synthesis.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
## Installation
|
| 26 |
|
| 27 |
```bash
|
| 28 |
+
git clone https://github.com/AMAPVOICE/PilotTTS.git
|
| 29 |
+
cd PilotTTS
|
| 30 |
conda create -n pilot-tts python=3.10 -y
|
| 31 |
conda activate pilot-tts
|
| 32 |
pip install -r requirements.txt
|
| 33 |
```
|
| 34 |
|
| 35 |
+
## Sample Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
To use PilotTTS, you can use the following Python snippet for zero-shot voice cloning and emotion-controlled synthesis:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
```python
|
| 40 |
from demo import load_engine, synthesize
|
| 41 |
|
| 42 |
+
# 1. Zero-shot voice cloning (base model)
|
| 43 |
engine = load_engine(
|
| 44 |
config_path="configs/infer_pilot_tts.yaml",
|
| 45 |
checkpoint="pretrained_models/pilot_tts.pt",
|
|
|
|
| 48 |
prompt_wav="assert/prompt.wav",
|
| 49 |
output_path="output/clone.wav")
|
| 50 |
|
| 51 |
+
# 2. Emotion synthesis (instruct model)
|
| 52 |
engine_instruct = load_engine(
|
| 53 |
config_path="configs/infer_pilot_tts_instruct.yaml",
|
| 54 |
checkpoint="pretrained_models/pilot_tts_instruct.pt",
|
| 55 |
)
|
| 56 |
|
|
|
|
| 57 |
synthesize(engine_instruct, text="今天天气真好啊!",
|
| 58 |
prompt_wav="assert/prompt.wav",
|
| 59 |
emotion="happy", output_path="output/happy.wav")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
```
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
## Citation
|
| 63 |
|
| 64 |
```bibtex
|
| 65 |
+
@article{pilottts2026,
|
| 66 |
title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
|
| 67 |
+
author={Bowen Li and Shaotong Guo and Zhen Wang and Yang Xiang and Mingli Jin and Yihang Lin and Jiahui Zhao and Weibo Xiong and Dongrui Li and Keming Chen and Yunze Gao and Yuze Zhou and Zeyang Lin and Yue Liu},
|
| 68 |
+
year={2026},
|
| 69 |
+
journal={arXiv preprint arXiv:2605.27258}
|
| 70 |
}
|
| 71 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|