Improve model card metadata and content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +25 -230
README.md CHANGED
@@ -1,116 +1,45 @@
 
 
 
 
 
1
  # PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
2
 
3
  <div align="center">
4
  <img src="assert/Introduction.png" width="600" />
5
  </div>
6
 
7
- <p align="center">
8
- English &nbsp;|&nbsp; <a href="README_zh.md">中文</a>
9
- </p>
10
-
11
- <p align="center">
12
- 📑 <a href="#">Paper</a> &nbsp;|&nbsp; 🤗 <a href="https://huggingface.co/AmapVoice/PilotTTS">HuggingFace</a> &nbsp;|&nbsp; 🤖 <a href="https://www.modelscope.cn/models/AmapVoice/PilotTTS">ModelScope</a> &nbsp;|&nbsp; 🎧 <a href="https://amapvoice.github.io/PilotTTS/">Demos</a>
13
- </p>
14
-
15
 
16
- ## News 📝
17
-
18
- - **[2025.05]** Release Pilot-TTS base and instruct model weights
19
 
20
  ## Highlight 🔥
21
 
22
- **PilotTTS** is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering.
23
-
24
- ### Key Features
25
- - **A fully open-source data processing pipeline:** We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs.
26
- - **Content Consistency and Speaker Similarity Control:** On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).
27
- - **Emotion and Paralinguistic Control:** Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
28
- - **Dialect Control:** Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.
29
-
30
- ## Installation ⚙️
31
-
32
- ### Clone and install
33
-
34
- ```bash
35
- git clone https://github.com/xxx/pilot-tts.git
36
- cd pilot-tts
37
- ```
38
 
39
- ### Environment setup
40
 
41
  ```bash
 
 
42
  conda create -n pilot-tts python=3.10 -y
43
  conda activate pilot-tts
44
  pip install -r requirements.txt
45
  ```
46
 
47
- ### Model download
48
-
49
- #### 1. Pilot-TTS models (our weights)
50
-
51
- ```python
52
- # ModelScope
53
- from modelscope import snapshot_download
54
- snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
55
-
56
- # HuggingFace
57
- from huggingface_hub import snapshot_download
58
- snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
59
- ```
60
-
61
- This includes: `pilot_tts.pt`, `pilot_tts_instruct.pt`, and `tokenizer/`.
62
-
63
- #### 2. Third-party open-source models
64
-
65
- Download the following dependencies from their respective open-source projects:
66
 
67
- ```python
68
- from modelscope import snapshot_download
69
-
70
- # Qwen3-0.6B (LLM backbone)
71
- snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B')
72
-
73
- # CosyVoice3 (flow-matching vocoder, includes campplus.onnx)
74
- snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B')
75
- ```
76
-
77
- ```python
78
- from huggingface_hub import snapshot_download
79
-
80
- # w2v-bert-2.0 (audio feature extractor)
81
- snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0')
82
- ```
83
-
84
- > Note: `wav2vec2bert_stats.pt` (from [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)) is included in the Pilot-TTS model package.
85
-
86
- #### Final directory structure
87
-
88
- ```
89
- pretrained_models/
90
- ├── pilot_tts.pt # Base model (zero-shot voice cloning)
91
- ├── pilot_tts_instruct.pt # Instruct model (emotion, paralanguage, dialect)
92
- ├── Qwen3-0.6B/ # LLM backbone (from Qwen)
93
- ├── w2v-bert-2.0/ # Audio feature extractor (from Meta)
94
- ├── wav2vec2bert_stats.pt # Feature normalization stats (from MaskGCT)
95
- └── CosyVoice3-0.5B/ # Flow-matching vocoder (from FunAudioLLM)
96
- ```
97
-
98
- ## Quick Start 📖
99
-
100
- Run all inference demos with a single command:
101
-
102
- ```bash
103
- python demo.py
104
- ```
105
-
106
- ## Inference
107
-
108
- ### Python API
109
 
110
  ```python
111
  from demo import load_engine, synthesize
112
 
113
- # Zero-shot voice cloning (base model)
114
  engine = load_engine(
115
  config_path="configs/infer_pilot_tts.yaml",
116
  checkpoint="pretrained_models/pilot_tts.pt",
@@ -119,158 +48,24 @@ synthesize(engine, text="你好,世界!",
119
  prompt_wav="assert/prompt.wav",
120
  output_path="output/clone.wav")
121
 
122
- # Load instruct model (emotion, paralanguage, dialect)
123
  engine_instruct = load_engine(
124
  config_path="configs/infer_pilot_tts_instruct.yaml",
125
  checkpoint="pretrained_models/pilot_tts_instruct.pt",
126
  )
127
 
128
- # Emotion synthesis
129
  synthesize(engine_instruct, text="今天天气真好啊!",
130
  prompt_wav="assert/prompt.wav",
131
  emotion="happy", output_path="output/happy.wav")
132
-
133
- # Paralanguage
134
- synthesize(engine_instruct, text="这太好笑了<|LAUGH|>停不下来",
135
- prompt_wav="assert/prompt.wav",
136
- output_path="output/laugh.wav")
137
-
138
- # Dialect (Henan)
139
- synthesize(engine_instruct, text="中不中啊,咱俩一块儿去吃胡辣汤吧",
140
- prompt_wav="assert/prompt.wav",
141
- language="zh-henan", output_path="output/henan.wav")
142
- ```
143
-
144
- ### Command Line
145
-
146
- ```bash
147
- # Zero-shot voice cloning (base model)
148
- python inference.py \
149
- --checkpoint pretrained_models/pilot_tts.pt \
150
- --prompt-wav assert/prompt.wav \
151
- --text "需要合成的目标文本" \
152
- --output output/zeroshot.wav
153
-
154
- # Emotion synthesis (instruct model)
155
- python inference.py \
156
- --config configs/infer_pilot_tts_instruct.yaml \
157
- --checkpoint pretrained_models/pilot_tts_instruct.pt \
158
- --prompt-wav assert/prompt.wav \
159
- --text "今天天气真好啊,我们去公园玩吧!" \
160
- --emotion happy \
161
- --output output/emotion.wav
162
-
163
- # Paralanguage (instruct model)
164
- python inference.py \
165
- --config configs/infer_pilot_tts_instruct.yaml \
166
- --checkpoint pretrained_models/pilot_tts_instruct.pt \
167
- --prompt-wav assert/prompt.wav \
168
- --text "这个笑话太好笑了<|LAUGH|>我真的忍不住" \
169
- --output output/paralang.wav
170
-
171
- # Dialect synthesis (instruct model)
172
- python inference.py \
173
- --config configs/infer_pilot_tts_instruct.yaml \
174
- --checkpoint pretrained_models/pilot_tts_instruct.pt \
175
- --prompt-wav assert/prompt.wav \
176
- --text "中不中啊,咱俩一块儿去吃胡辣汤吧" \
177
- --language zh-henan \
178
- --output output/dialect.wav
179
  ```
180
 
181
- ### Supported Controls
182
-
183
- | Feature | Usage | Model |
184
- |---------|-------|-------|
185
- | Voice Cloning | Provide prompt audio | Both |
186
- | Emotions | `--emotion <tag>` | Instruct |
187
- | Paralanguage | Insert tags in text | Instruct |
188
- | Dialects | `--language <dialect>` | Instruct |
189
-
190
- **Emotions:**
191
-
192
- | Tag | 情感 | Tag | 情感 |
193
- |-----|------|-----|------|
194
- | `happy` | 开心 | `sad` | 悲伤 |
195
- | `angry` | 愤怒 | `surprise` | 惊讶 |
196
- | `fear` | 恐惧 | `disgust` | 厌恶 |
197
- | `serious` | 严肃 | `concern` | 关切 |
198
- | `blue` | 忧郁 | `disdain` | 轻蔑 |
199
- | `neutral` | 中性/平静 | `psychology` | 心理活动 |
200
- | `unknown` | 不指定情感 | | |
201
-
202
- **Paralanguage tags:**
203
-
204
- | Tag | Description |
205
- |-----|-------------|
206
- | `<\|LAUGH\|>` | 笑声 |
207
- | `<\|BREATH\|>` | 呼吸声 |
208
- | `<\|COUGH\|>` | 咳嗽 |
209
- | `<\|CRY\|>` | 哭泣声 |
210
- | `<\|LAUGH_SPAN\|>...<\|/LAUGH_SPAN\|>` | 包裹笑声文本 |
211
-
212
- **Dialects:**
213
-
214
- | Tag | 方言 | Tag | 方言 |
215
- |-----|------|-----|------|
216
- | `zh-dongbei` | 东北话 | `zh-shandong` | 山东话 |
217
- | `zh-henan` | 河南话 | `zh-shan1xi` | 山西话 |
218
- | `zh-minnan` | 闽南语 | `zh-gansu` | 甘肃话 |
219
- | `zh-ningxia` | 宁夏话 | `zh-shanghai` | 上海话 |
220
- | `zh-chongqing` | 重庆话 | `zh-hubei` | 湖北话 |
221
- | `zh-hunan` | 湖南话 | `zh-jiangxi` | 江西话 |
222
- | `zh-guizhou` | 贵州话 | `zh-yunnan` | 云南话 |
223
-
224
- ## WebUI
225
-
226
- Launch a Gradio-based interactive interface:
227
-
228
- ```bash
229
- python webui.py --port 9000
230
- ```
231
-
232
- ## Project Structure
233
-
234
- ```
235
- pilot-tts/
236
- ├── configs/ # Inference configurations (per checkpoint)
237
- ├── demo.py # Complete demo (all inference modes)
238
- ├── inference.py # CLI inference entry
239
- ├── webui.py # Gradio WebUI
240
- ├── asset/ # Example prompt audio
241
- ├── pilot_voice/ # Core model code
242
- │ ├── engine.py # InferenceEngine pipeline
243
- │ ├── model.py # AR model (Qwen3 backbone + audio tokens)
244
- │ ├── sampling.py # RAS sampling (from VALL-E 2)
245
- │ ├── utils.py # Utilities
246
- │ ├── modules/ # Conformer + Perceiver modules
247
- │ └── tools/ # Audio & text processing
248
- ├── third_party/
249
- │ ├── cosyvoice/ # Flow-matching vocoder
250
- │ └── Matcha-TTS/ # Flow matching dependency
251
- ├── tokenizer/ # Custom tokenizer with special tokens
252
- ├── pretrained_models/ # Model weights (not in git)
253
- └── requirements.txt
254
- ```
255
-
256
- ## Acknowledgements
257
-
258
- - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) — Flow-matching & Vocoder
259
- - [Qwen3](https://github.com/QwenLM/Qwen3) — LLM backbone
260
- - [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) — Flow matching framework
261
- - [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) — wav2vec2bert feature statistics
262
-
263
  ## Citation
264
 
265
  ```bibtex
266
- @article{pilottts2025,
267
  title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
268
- author={},
269
- year={2025},
270
- journal={arXiv preprint arXiv:xxxx.xxxxx}
271
  }
272
- ```
273
-
274
- ## License
275
-
276
- Apache-2.0
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-speech
4
+ ---
5
+
6
  # PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
7
 
8
  <div align="center">
9
  <img src="assert/Introduction.png" width="600" />
10
  </div>
11
 
12
+ PilotTTS is a lightweight autoregressive text-to-speech (TTS) system that achieves competitive performance through minimalist architecture and rigorous data engineering. It supports zero-shot voice cloning, emotion synthesis, paralinguistic synthesis, and various Chinese dialects.
 
 
 
 
 
 
 
13
 
14
+ - **Paper:** [PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis](https://arxiv.org/abs/2605.27258)
15
+ - **Code:** [GitHub Repository](https://github.com/AMAPVOICE/PilotTTS)
16
+ - **Demos:** [Project Page](https://amapvoice.github.io/PilotTTS/)
17
 
18
  ## Highlight 🔥
19
 
20
+ - **A fully open-source data processing pipeline:** Converts large-scale Internet audio into clean training data with rich annotation using publicly available tools.
21
+ - **Content Consistency and Speaker Similarity:** Achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%) on Seed-TTS benchmarks.
22
+ - **Controllable Synthesis:** Supports 11 emotion categories (e.g., Happy, Sad, Angry) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
23
+ - **Dialect Support:** Supports 14 Chinese dialects and enables cross-dialect synthesis.
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ ## Installation
26
 
27
  ```bash
28
+ git clone https://github.com/AMAPVOICE/PilotTTS.git
29
+ cd PilotTTS
30
  conda create -n pilot-tts python=3.10 -y
31
  conda activate pilot-tts
32
  pip install -r requirements.txt
33
  ```
34
 
35
+ ## Sample Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ To use PilotTTS, you can use the following Python snippet for zero-shot voice cloning and emotion-controlled synthesis:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ```python
40
  from demo import load_engine, synthesize
41
 
42
+ # 1. Zero-shot voice cloning (base model)
43
  engine = load_engine(
44
  config_path="configs/infer_pilot_tts.yaml",
45
  checkpoint="pretrained_models/pilot_tts.pt",
 
48
  prompt_wav="assert/prompt.wav",
49
  output_path="output/clone.wav")
50
 
51
+ # 2. Emotion synthesis (instruct model)
52
  engine_instruct = load_engine(
53
  config_path="configs/infer_pilot_tts_instruct.yaml",
54
  checkpoint="pretrained_models/pilot_tts_instruct.pt",
55
  )
56
 
 
57
  synthesize(engine_instruct, text="今天天气真好啊!",
58
  prompt_wav="assert/prompt.wav",
59
  emotion="happy", output_path="output/happy.wav")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ```
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## Citation
63
 
64
  ```bibtex
65
+ @article{pilottts2026,
66
  title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
67
+ author={Bowen Li and Shaotong Guo and Zhen Wang and Yang Xiang and Mingli Jin and Yihang Lin and Jiahui Zhao and Weibo Xiong and Dongrui Li and Keming Chen and Yunze Gao and Yuze Zhou and Zeyang Lin and Yue Liu},
68
+ year={2026},
69
+ journal={arXiv preprint arXiv:2605.27258}
70
  }
71
+ ```