Text-to-Speech
ONNX
Safetensors
aluminumbox commited on
Commit
ed5fc9f
·
verified ·
1 Parent(s): 36df8ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -18
README.md CHANGED
@@ -10,21 +10,21 @@ language:
10
  - it
11
  - ru
12
  ---
13
- [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)](https://github.com/Akshay090/svg-banners)
14
 
15
- ## 👉🏻 [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) 👈🏻
16
 
17
- **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
18
 
19
- **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B)
20
 
21
- **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice-300M)
22
 
23
  ## Highlight🔥
24
 
25
  **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
26
  ### Key Features
27
- - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
28
  - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
29
  - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
30
  - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
@@ -36,7 +36,7 @@ language:
36
 
37
  - [x] 2025/12
38
 
39
- - [x] release Fun-CosyVoice3-0.5B-2512 base model and its training/inference script
40
  - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
41
 
42
  - [x] 2025/08
@@ -45,7 +45,7 @@ language:
45
 
46
  - [x] 2025/07
47
 
48
- - [x] release CosyVoice 3.0 eval set
49
 
50
  - [x] 2025/05
51
 
@@ -72,7 +72,8 @@ language:
72
  - [x] Fastapi server and client
73
 
74
  ## Evaluation
75
- | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑|
 
76
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
77
  | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
78
  | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
@@ -91,6 +92,7 @@ language:
91
  | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
92
  | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
93
 
 
94
  ## Install
95
 
96
  ### Clone and install
@@ -120,12 +122,10 @@ language:
120
 
121
  ### Model download
122
 
123
- We strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` model and `CosyVoice-ttsfrd` resource.
124
-
125
  ``` python
126
- from huggingface_hub import snapshot_download
127
  snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
128
- snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
129
  ```
130
 
131
  Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
@@ -141,11 +141,92 @@ pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
141
 
142
  ### Basic Usage
143
 
144
- We strongly recommend using `Fun-CosyVoice3-0.5B` for better performance.
145
- Follow the code in `example.py` for detailed usage of each model.
146
- ```sh
147
- python example.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ```
149
 
150
  ## Disclaimer
151
- The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
 
10
  - it
11
  - ru
12
  ---
13
+ ![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)
14
 
15
+ ## 👉🏻 CosyVoice 👈🏻
16
 
17
+ **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
18
 
19
+ **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
20
 
21
+ **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
22
 
23
  ## Highlight🔥
24
 
25
  **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
26
  ### Key Features
27
+ - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
28
  - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
29
  - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
30
  - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
 
36
 
37
  - [x] 2025/12
38
 
39
+ - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
40
  - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
41
 
42
  - [x] 2025/08
 
45
 
46
  - [x] 2025/07
47
 
48
+ - [x] release Fun-CosyVoice 3.0 eval set
49
 
50
  - [x] 2025/05
51
 
 
72
  - [x] Fastapi server and client
73
 
74
  ## Evaluation
75
+
76
+ | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑ |
77
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
78
  | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
79
  | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
 
92
  | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
93
  | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
94
 
95
+
96
  ## Install
97
 
98
  ### Clone and install
 
122
 
123
  ### Model download
124
 
 
 
125
  ``` python
126
+ from modelscope import snapshot_download
127
  snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
128
+ snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
129
  ```
130
 
131
  Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
 
141
 
142
  ### Basic Usage
143
 
144
+ ``` python
145
+ import sys
146
+ sys.path.append('third_party/Matcha-TTS')
147
+ from cosyvoice.cli.cosyvoice import AutoModel
148
+ import torchaudio
149
+
150
+ """ CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
151
+ """
152
+ cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
153
+ # en zero_shot usage
154
+ for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
155
+ './asset/zero_shot_prompt.wav', stream=False)):
156
+ torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
157
+ # zh zero_shot usage
158
+ for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
159
+ './asset/zero_shot_prompt.wav', stream=False)):
160
+ torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
161
+
162
+ # fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280
163
+ for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
164
+ './asset/zero_shot_prompt.wav', stream=False)):
165
+ torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
166
+
167
+ # instruct usage, for supported control, check cosyvoice/utils/common.py#L28
168
+ for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
169
+ './asset/zero_shot_prompt.wav', stream=False)):
170
+ torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
171
+ for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
172
+ './asset/zero_shot_prompt.wav', stream=False)):
173
+ torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
174
+
175
+ # hotfix usage
176
+ for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
177
+ './asset/zero_shot_prompt.wav', stream=False)):
178
+ torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
179
+ ```
180
+
181
+ ## Discussion & Communication
182
+
183
+ You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
184
+
185
+ You can also scan the QR code to join our official Dingding chat group.
186
+
187
+ <img src="./asset/dingding.png" width="250px">
188
+
189
+ ## Acknowledge
190
+
191
+ 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
192
+ 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
193
+ 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
194
+ 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
195
+ 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
196
+
197
+ ## Citations
198
+
199
+ ``` bibtex
200
+ @article{du2024cosyvoice,
201
+ title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
202
+ author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
203
+ journal={arXiv preprint arXiv:2407.05407},
204
+ year={2024}
205
+ }
206
+
207
+ @article{du2024cosyvoice,
208
+ title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
209
+ author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
210
+ journal={arXiv preprint arXiv:2412.10117},
211
+ year={2024}
212
+ }
213
+
214
+ @article{du2025cosyvoice,
215
+ title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
216
+ author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
217
+ journal={arXiv preprint arXiv:2505.17589},
218
+ year={2025}
219
+ }
220
+
221
+ @inproceedings{lyu2025build,
222
+ title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
223
+ author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
224
+ booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
225
+ pages={1--2},
226
+ year={2025},
227
+ organization={IEEE}
228
+ }
229
  ```
230
 
231
  ## Disclaimer
232
+ The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.