Text-to-Speech
ONNX
Safetensors

FunAudioLLM/Fun-CosyVoice3-0.5B-2512

#5
by markan5500 - opened
README.md CHANGED
@@ -9,25 +9,22 @@ language:
9
  - ko
10
  - it
11
  - ru
12
- - de
13
- pipeline_tag: text-to-speech
14
  ---
 
15
 
16
- ![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)
17
 
18
- ## 👉🏻 CosyVoice 👈🏻
19
 
20
- **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
21
 
22
- **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
23
-
24
- **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
25
 
26
  ## Highlight🔥
27
 
28
  **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
29
  ### Key Features
30
- - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
31
  - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
32
  - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
33
  - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
@@ -39,7 +36,7 @@ pipeline_tag: text-to-speech
39
 
40
  - [x] 2025/12
41
 
42
- - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
43
  - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
44
 
45
  - [x] 2025/08
@@ -48,7 +45,7 @@ pipeline_tag: text-to-speech
48
 
49
  - [x] 2025/07
50
 
51
- - [x] release Fun-CosyVoice 3.0 eval set
52
 
53
  - [x] 2025/05
54
 
@@ -75,8 +72,7 @@ pipeline_tag: text-to-speech
75
  - [x] Fastapi server and client
76
 
77
  ## Evaluation
78
-
79
- | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑ |
80
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
81
  | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
82
  | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
@@ -95,7 +91,6 @@ pipeline_tag: text-to-speech
95
  | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
96
  | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
97
 
98
-
99
  ## Install
100
 
101
  ### Clone and install
@@ -125,6 +120,8 @@ pipeline_tag: text-to-speech
125
 
126
  ### Model download
127
 
 
 
128
  ``` python
129
  from huggingface_hub import snapshot_download
130
  snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
@@ -144,91 +141,10 @@ pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
144
 
145
  ### Basic Usage
146
 
147
- ``` python
148
- import sys
149
- sys.path.append('third_party/Matcha-TTS')
150
- from cosyvoice.cli.cosyvoice import AutoModel
151
- import torchaudio
152
-
153
- """ CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
154
- """
155
- cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
156
- # en zero_shot usage
157
- for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
158
- './asset/zero_shot_prompt.wav', stream=False)):
159
- torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
160
- # zh zero_shot usage
161
- for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
162
- './asset/zero_shot_prompt.wav', stream=False)):
163
- torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
164
-
165
- # fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280
166
- for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
167
- './asset/zero_shot_prompt.wav', stream=False)):
168
- torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
169
-
170
- # instruct usage, for supported control, check cosyvoice/utils/common.py#L28
171
- for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
172
- './asset/zero_shot_prompt.wav', stream=False)):
173
- torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
174
- for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
175
- './asset/zero_shot_prompt.wav', stream=False)):
176
- torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
177
-
178
- # hotfix usage
179
- for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
180
- './asset/zero_shot_prompt.wav', stream=False)):
181
- torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
182
- ```
183
-
184
- ## Discussion & Communication
185
-
186
- You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
187
-
188
- You can also scan the QR code to join our official Dingding chat group.
189
-
190
- <img src="./asset/dingding.png" width="250px">
191
-
192
- ## Acknowledge
193
-
194
- 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
195
- 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
196
- 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
197
- 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
198
- 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
199
-
200
- ## Citations
201
-
202
- ``` bibtex
203
- @article{du2024cosyvoice,
204
- title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
205
- author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
206
- journal={arXiv preprint arXiv:2407.05407},
207
- year={2024}
208
- }
209
-
210
- @article{du2024cosyvoice,
211
- title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
212
- author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
213
- journal={arXiv preprint arXiv:2412.10117},
214
- year={2024}
215
- }
216
-
217
- @article{du2025cosyvoice,
218
- title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
219
- author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
220
- journal={arXiv preprint arXiv:2505.17589},
221
- year={2025}
222
- }
223
-
224
- @inproceedings{lyu2025build,
225
- title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
226
- author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
227
- booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
228
- pages={1--2},
229
- year={2025},
230
- organization={IEEE}
231
- }
232
  ```
233
 
234
  ## Disclaimer
 
9
  - ko
10
  - it
11
  - ru
 
 
12
  ---
13
+ [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)](https://github.com/Akshay090/svg-banners)
14
 
15
+ ## 👉🏻 [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) 👈🏻
16
 
17
+ **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
18
 
19
+ **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B)
20
 
21
+ **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice-300M)
 
 
22
 
23
  ## Highlight🔥
24
 
25
  **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
26
  ### Key Features
27
+ - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
28
  - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
29
  - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
30
  - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
 
36
 
37
  - [x] 2025/12
38
 
39
+ - [x] release Fun-CosyVoice3-0.5B-2512 base model and its training/inference script
40
  - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
41
 
42
  - [x] 2025/08
 
45
 
46
  - [x] 2025/07
47
 
48
+ - [x] release CosyVoice 3.0 eval set
49
 
50
  - [x] 2025/05
51
 
 
72
  - [x] Fastapi server and client
73
 
74
  ## Evaluation
75
+ | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑|
 
76
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
77
  | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
78
  | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
 
91
  | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
92
  | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
93
 
 
94
  ## Install
95
 
96
  ### Clone and install
 
120
 
121
  ### Model download
122
 
123
+ We strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` model and `CosyVoice-ttsfrd` resource.
124
+
125
  ``` python
126
  from huggingface_hub import snapshot_download
127
  snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
 
141
 
142
  ### Basic Usage
143
 
144
+ We strongly recommend using `Fun-CosyVoice3-0.5B` for better performance.
145
+ Follow the code in `example.py` for detailed usage of each model.
146
+ ```sh
147
+ python example.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ```
149
 
150
  ## Disclaimer
config.json DELETED
@@ -1 +0,0 @@
1
- {}
 
 
flow.decoder.estimator.fp32.onnx DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:9b51b9533a55937762b262bf2cf9c6220ce40760f76d6532cb16a6a6d84059a8
3
- size 1326216933
 
 
 
 
speech_tokenizer_v3.batch.onnx DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b156b8a7bbff436585e153f4637b9a368009005ac66efa108a6c8bfb34e5ee43
3
- size 969451579