PilotTTS / README.md

Upload folder using huggingface_hub

95f3777 verified about 1 month ago

9.65 kB

	---
	frameworks:
	- ""
	---
	# PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

	<div align="center">
	<img src="assert/Introduction.png" width="600" />
	</div>

	<p align="center">
	English  \|  <a href="README_zh.md">中文</a>
	</p>

	<p align="center">
	📑 <a href="#">Paper</a>  \|  🤗 <a href="https://huggingface.co/AmapVoice/PilotTTS">HuggingFace</a>  \|  🤖 <a href="https://www.modelscope.cn/models/AmapVoice/PilotTTS">ModelScope</a>  \|  🎧 <a href="https://amapvoice.github.io/PilotTTS/">Demos</a>
	</p>


	## News 📝

	- [2025.05] Release Pilot-TTS base and instruct model weights

	## Highlight 🔥

	PilotTTS is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering.

	### Key Features
	- A fully open-source data processing pipeline: We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs.
	- Content Consistency and Speaker Similarity Control: On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).
	- Emotion and Paralinguistic Control: Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
	- Dialect Control: Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.

	## Installation ⚙️

	### Clone and install

	```bash
	git clone https://github.com/xxx/pilot-tts.git
	cd pilot-tts
	```

	### Environment setup

	```bash
	conda create -n pilot-tts python=3.10 -y
	conda activate pilot-tts
	pip install -r requirements.txt
	```

	### Model download

	#### 1. Pilot-TTS models (our weights)

	```python
	# ModelScope
	from modelscope import snapshot_download
	snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')

	# HuggingFace
	from huggingface_hub import snapshot_download
	snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
	```

	This includes: `pilot_tts.pt`, `pilot_tts_instruct.pt`, and `tokenizer/`.

	#### 2. Third-party open-source models

	Download the following dependencies from their respective open-source projects:

	```python
	from modelscope import snapshot_download

	# Qwen3-0.6B (LLM backbone)
	snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B')

	# CosyVoice3 (flow-matching vocoder, includes campplus.onnx)
	snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B')
	```

	```python
	from huggingface_hub import snapshot_download

	# w2v-bert-2.0 (audio feature extractor)
	snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0')
	```

	> Note: `wav2vec2bert_stats.pt` (from [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)) is included in the Pilot-TTS model package.

	#### Final directory structure

	```
	pretrained_models/
	├── pilot_tts.pt # Base model (zero-shot voice cloning)
	├── pilot_tts_instruct.pt # Instruct model (emotion, paralanguage, dialect)
	├── Qwen3-0.6B/ # LLM backbone (from Qwen)
	├── w2v-bert-2.0/ # Audio feature extractor (from Meta)
	├── wav2vec2bert_stats.pt # Feature normalization stats (from MaskGCT)
	└── CosyVoice3-0.5B/ # Flow-matching vocoder (from FunAudioLLM)
	```

	## Quick Start 📖

	Run all inference demos with a single command:

	```bash
	python demo.py
	```

	## Inference

	### Python API

	```python
	from demo import load_engine, synthesize

	# Zero-shot voice cloning (base model)
	engine = load_engine(
	config_path="configs/infer_pilot_tts.yaml",
	checkpoint="pretrained_models/pilot_tts.pt",
	)
	synthesize(engine, text="你好，世界！",
	prompt_wav="assert/prompt.wav",
	output_path="output/clone.wav")

	# Load instruct model (emotion, paralanguage, dialect)
	engine_instruct = load_engine(
	config_path="configs/infer_pilot_tts_instruct.yaml",
	checkpoint="pretrained_models/pilot_tts_instruct.pt",
	)

	# Emotion synthesis
	synthesize(engine_instruct, text="今天天气真好啊！",
	prompt_wav="assert/prompt.wav",
	emotion="happy", output_path="output/happy.wav")

	# Paralanguage
	synthesize(engine_instruct, text="这太好笑了<\|LAUGH\|>停不下来",
	prompt_wav="assert/prompt.wav",
	output_path="output/laugh.wav")

	# Dialect (Henan)
	synthesize(engine_instruct, text="中不中啊，咱俩一块儿去吃胡辣汤吧",
	prompt_wav="assert/prompt.wav",
	language="zh-henan", output_path="output/henan.wav")
	```

	### Command Line

	```bash
	# Zero-shot voice cloning (base model)
	python inference.py \
	--checkpoint pretrained_models/pilot_tts.pt \
	--prompt-wav assert/prompt.wav \
	--text "需要合成的目标文本" \
	--output output/zeroshot.wav

	# Emotion synthesis (instruct model)
	python inference.py \
	--config configs/infer_pilot_tts_instruct.yaml \
	--checkpoint pretrained_models/pilot_tts_instruct.pt \
	--prompt-wav assert/prompt.wav \
	--text "今天天气真好啊，我们去公园玩吧！" \
	--emotion happy \
	--output output/emotion.wav

	# Paralanguage (instruct model)
	python inference.py \
	--config configs/infer_pilot_tts_instruct.yaml \
	--checkpoint pretrained_models/pilot_tts_instruct.pt \
	--prompt-wav assert/prompt.wav \
	--text "这个笑话太好笑了<\|LAUGH\|>我真的忍不住" \
	--output output/paralang.wav

	# Dialect synthesis (instruct model)
	python inference.py \
	--config configs/infer_pilot_tts_instruct.yaml \
	--checkpoint pretrained_models/pilot_tts_instruct.pt \
	--prompt-wav assert/prompt.wav \
	--text "中不中啊，咱俩一块儿去吃胡辣汤吧" \
	--language zh-henan \
	--output output/dialect.wav
	```

	### Supported Controls

	\| Feature \| Usage \| Model \|
	\|---------\|-------\|-------\|
	\| Voice Cloning \| Provide prompt audio \| Both \|
	\| Emotions \| `--emotion <tag>` \| Instruct \|
	\| Paralanguage \| Insert tags in text \| Instruct \|
	\| Dialects \| `--language <dialect>` \| Instruct \|

	Emotions:

	\| Tag \| 情感 \| Tag \| 情感 \|
	\|-----\|------\|-----\|------\|
	\| `happy` \| 开心 \| `sad` \| 悲伤 \|
	\| `angry` \| 愤怒 \| `surprise` \| 惊讶 \|
	\| `fear` \| 恐惧 \| `disgust` \| 厌恶 \|
	\| `serious` \| 严肃 \| `concern` \| 关切 \|
	\| `blue` \| 忧郁 \| `disdain` \| 轻蔑 \|
	\| `neutral` \| 中性/平静 \| `psychology` \| 心理活动 \|
	\| `unknown` \| 不指定情感 \| \| \|

	Paralanguage tags:

	\| Tag \| Description \|
	\|-----\|-------------\|
	\| `<\\|LAUGH\\|>` \| 笑声 \|
	\| `<\\|BREATH\\|>` \| 呼吸声 \|
	\| `<\\|COUGH\\|>` \| 咳嗽 \|
	\| `<\\|CRY\\|>` \| 哭泣声 \|
	\| `<\\|LAUGH_SPAN\\|>...<\\|/LAUGH_SPAN\\|>` \| 包裹笑声文本 \|

	Dialects:

	\| Tag \| 方言 \| Tag \| 方言 \|
	\|-----\|------\|-----\|------\|
	\| `zh-dongbei` \| 东北话 \| `zh-shandong` \| 山东话 \|
	\| `zh-henan` \| 河南话 \| `zh-shan1xi` \| 山西话 \|
	\| `zh-minnan` \| 闽南语 \| `zh-gansu` \| 甘肃话 \|
	\| `zh-ningxia` \| 宁夏话 \| `zh-shanghai` \| 上海话 \|
	\| `zh-chongqing` \| 重庆话 \| `zh-hubei` \| 湖北话 \|
	\| `zh-hunan` \| 湖南话 \| `zh-jiangxi` \| 江西话 \|
	\| `zh-guizhou` \| 贵州话 \| `zh-yunnan` \| 云南话 \|

	## WebUI

	Launch a Gradio-based interactive interface:

	```bash
	python webui.py --port 9000
	```

	## Project Structure

	```
	pilot-tts/
	├── configs/ # Inference configurations (per checkpoint)
	├── demo.py # Complete demo (all inference modes)
	├── inference.py # CLI inference entry
	├── webui.py # Gradio WebUI
	├── asset/ # Example prompt audio
	├── pilot_voice/ # Core model code
	│ ├── engine.py # InferenceEngine pipeline
	│ ├── model.py # AR model (Qwen3 backbone + audio tokens)
	│ ├── sampling.py # RAS sampling (from VALL-E 2)
	│ ├── utils.py # Utilities
	│ ├── modules/ # Conformer + Perceiver modules
	│ └── tools/ # Audio & text processing
	├── third_party/
	│ ├── cosyvoice/ # Flow-matching vocoder
	│ └── Matcha-TTS/ # Flow matching dependency
	├── tokenizer/ # Custom tokenizer with special tokens
	├── pretrained_models/ # Model weights (not in git)
	└── requirements.txt
	```

	## Acknowledgements

	- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) — Flow-matching & Vocoder
	- [Qwen3](https://github.com/QwenLM/Qwen3) — LLM backbone
	- [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) — Flow matching framework
	- [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) — wav2vec2bert feature statistics

	## Citation

	```bibtex
	@article{pilottts2025,
	title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
	author={},
	year={2025},
	journal={arXiv preprint arXiv:xxxx.xxxxx}
	}
	```

	## License

	Apache-2.0