Translsis commited on
Commit
1c4ed9c
Β·
verified Β·
1 Parent(s): 734e0c0

Upload files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ resource.zip filter=lfs diff=lfs merge=lfs -text
37
+ ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl filter=lfs diff=lfs merge=lfs -text
38
+ ttsfrd-0.4.2-cp38-cp38-linux_x86_64.whl filter=lfs diff=lfs merge=lfs -text
39
+ ttsfrd_dependency-0.1-py3-none-any.whl filter=lfs diff=lfs merge=lfs -text
40
+ asset/dingding.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ - fr
7
+ - es
8
+ - ja
9
+ - ko
10
+ - it
11
+ - ru
12
+ - de
13
+ ---
14
+
15
+ ![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🀠&text2=Text-to-Speech%20πŸ’–%20Large%20Language%20Model&width=800&height=210)
16
+
17
+ ## πŸ‘‰πŸ» CosyVoice πŸ‘ˆπŸ»
18
+
19
+ **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
20
+
21
+ **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
22
+
23
+ **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
24
+
25
+ ## HighlightπŸ”₯
26
+
27
+ **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
28
+ ### Key Features
29
+ - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
30
+ - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
31
+ - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
32
+ - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
33
+ - **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
34
+ - **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
35
+
36
+
37
+ ## Roadmap
38
+
39
+ - [x] 2025/12
40
+
41
+ - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
42
+ - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
43
+
44
+ - [x] 2025/08
45
+
46
+ - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
47
+
48
+ - [x] 2025/07
49
+
50
+ - [x] release Fun-CosyVoice 3.0 eval set
51
+
52
+ - [x] 2025/05
53
+
54
+ - [x] add CosyVoice2-0.5B vllm support
55
+
56
+ - [x] 2024/12
57
+
58
+ - [x] 25hz CosyVoice2-0.5B released
59
+
60
+ - [x] 2024/09
61
+
62
+ - [x] 25hz CosyVoice-300M base model
63
+ - [x] 25hz CosyVoice-300M voice conversion function
64
+
65
+ - [x] 2024/08
66
+
67
+ - [x] Repetition Aware Sampling(RAS) inference for llm stability
68
+ - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
69
+
70
+ - [x] 2024/07
71
+
72
+ - [x] Flow matching training support
73
+ - [x] WeTextProcessing support when ttsfrd is not available
74
+ - [x] Fastapi server and client
75
+
76
+ ## Evaluation
77
+
78
+ | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑ |
79
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
80
+ | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
81
+ | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
82
+ | MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
83
+ | F5-TTS | βœ… | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
84
+ | Spark TTS | βœ… | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
85
+ | CosyVoice2 | βœ… | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
86
+ | FireRedTTS2 | βœ… | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
87
+ | Index-TTS2 | βœ… | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
88
+ | VibeVoice-1.5B | βœ… | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
89
+ | VibeVoice-Realtime | βœ… | 0.5B | - | - | 2.05 | 63.3 | - | - |
90
+ | HiggsAudio-v2 | βœ… | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
91
+ | VoxCPM | βœ… | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
92
+ | GLM-TTS | βœ… | 1.5B | 1.03 | 76.1 | - | - | - | - |
93
+ | GLM-TTS RL | βœ… | 1.5B | 0.89 | 76.4 | - | - | - | - |
94
+ | Fun-CosyVoice3-0.5B-2512 | βœ… | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
95
+ | Fun-CosyVoice3-0.5B-2512_RL | βœ… | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
96
+
97
+
98
+ ## Install
99
+
100
+ ### Clone and install
101
+
102
+ - Clone the repo
103
+ ``` sh
104
+ git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
105
+ # If you failed to clone the submodule due to network failures, please run the following command until success
106
+ cd CosyVoice
107
+ git submodule update --init --recursive
108
+ ```
109
+
110
+ - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
111
+ - Create Conda env:
112
+
113
+ ``` sh
114
+ conda create -n cosyvoice -y python=3.10
115
+ conda activate cosyvoice
116
+ pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
117
+
118
+ # If you encounter sox compatibility issues
119
+ # ubuntu
120
+ sudo apt-get install sox libsox-dev
121
+ # centos
122
+ sudo yum install sox sox-devel
123
+ ```
124
+
125
+ ### Model download
126
+
127
+ ``` python
128
+ from huggingface_hub import snapshot_download
129
+ snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
130
+ ```
131
+
132
+ Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
133
+
134
+ Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.
135
+
136
+ ``` sh
137
+ cd pretrained_models/CosyVoice-ttsfrd/
138
+ unzip resource.zip -d .
139
+ pip install ttsfrd_dependency-0.1-py3-none-any.whl
140
+ pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
141
+ ```
142
+
143
+ ## Discussion & Communication
144
+
145
+ You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
146
+
147
+ You can also scan the QR code to join our official Dingding chat group.
148
+
149
+ <img src="./asset/dingding.png" width="250px">
150
+
151
+ ## Acknowledge
152
+
153
+ 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
154
+ 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
155
+ 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
156
+ 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
157
+ 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
158
+
159
+ ## Citations
160
+
161
+ ``` bibtex
162
+ @article{du2024cosyvoice,
163
+ title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
164
+ author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
165
+ journal={arXiv preprint arXiv:2407.05407},
166
+ year={2024}
167
+ }
168
+
169
+ @article{du2024cosyvoice,
170
+ title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
171
+ author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
172
+ journal={arXiv preprint arXiv:2412.10117},
173
+ year={2024}
174
+ }
175
+
176
+ @article{du2025cosyvoice,
177
+ title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
178
+ author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
179
+ journal={arXiv preprint arXiv:2505.17589},
180
+ year={2025}
181
+ }
182
+
183
+ @inproceedings{lyu2025build,
184
+ title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
185
+ author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
186
+ booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
187
+ pages={1--2},
188
+ year={2025},
189
+ organization={IEEE}
190
+ }
191
+ ```
192
+
193
+ ## Disclaimer
194
+ The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
asset/dingding.png ADDED

Git LFS Details

  • SHA256: 7f04815e2e676d31b089af6fa270135f3214f2193d5e0ad98b491d007d48f1c6
  • Pointer size: 131 Bytes
  • Size of remote file: 123 kB
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"text-to-speech"}
resource.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcb3970fd4f52d036f245493360d97d0da1014f917deb4b9d83a3ded97483113
3
+ size 338914555
ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:149514a05c2a7a01076170d2923efd231fc579f8e48403ab523814c8256efe14
3
+ size 4044429
ttsfrd-0.4.2-cp38-cp38-linux_x86_64.whl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52dfa13538984a35664185b36809c64800ead605d0371b288816c9782cfdd1ec
3
+ size 4044404
ttsfrd_dependency-0.1-py3-none-any.whl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:060a53f0650d12839983afdcfb052b049d7cf5c62344a00fee3a7344582aaf6f
3
+ size 1144992