zhouyx1998 commited on
Commit
07f4cde
Β·
verified Β·
1 Parent(s): 9454c2d
Files changed (2) hide show
  1. README.md +225 -3
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,3 +1,225 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ - ar
6
+ - my
7
+ - da
8
+ - nl
9
+ - fi
10
+ - fr
11
+ - de
12
+ - el
13
+ - he
14
+ - hi
15
+ - id
16
+ - it
17
+ - ja
18
+ - km
19
+ - ko
20
+ - lo
21
+ - ms
22
+ - no
23
+ - pl
24
+ - pt
25
+ - ru
26
+ - es
27
+ - sw
28
+ - sv
29
+ - tl
30
+ - th
31
+ - tr
32
+ - vi
33
+ license: apache-2.0
34
+ library_name: voxcpm2
35
+ tags:
36
+ - text-to-speech
37
+ - tts
38
+ - multilingual
39
+ - voice-cloning
40
+ - voice-design
41
+ - diffusion
42
+ - audio
43
+ pipeline_tag: text-to-speech
44
+ ---
45
+
46
+ # VoxCPM2
47
+
48
+ **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model β€” **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
49
+
50
+ [![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
51
+ [![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
52
+ [![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
53
+ [![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
54
+ [![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
55
+
56
+ ## Highlights
57
+
58
+ - 🌍 **30-Language Multilingual** β€” No language tag needed; input text in any supported language directly
59
+ - 🎨 **Voice Design** β€” Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
60
+ - πŸŽ›οΈ **Controllable Cloning** β€” Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
61
+ - πŸŽ™οΈ **Ultimate Cloning** β€” Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
62
+ - πŸ”Š **48kHz Studio-Quality Output** β€” Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
63
+ - 🧠 **Context-Aware Synthesis** β€” Automatically infers appropriate prosody and expressiveness from text content
64
+ - ⚑ **Real-Time Streaming** β€” RTF ~0.13 on RTX 4090 with [Nano-vLLM](https://github.com/a710128/nanovllm-voxcpm)
65
+ - πŸ“œ **Fully Open-Source & Commercial-Ready** β€” Apache-2.0 license, free for commercial use
66
+
67
+ <details>
68
+ <summary><b>Supported Languages (30)</b></summary>
69
+
70
+ Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
71
+
72
+ Chinese Dialects: 四川话, η²€θ―­, 吴语, δΈœεŒ—θ―, 河南话, ι™•θ₯Ώθ―, 山东话, 倩ζ΄₯话, 闽南话
73
+ </details>
74
+
75
+ ## Quick Start
76
+
77
+ ### Installation
78
+
79
+ ```bash
80
+ pip install voxcpm
81
+ ```
82
+
83
+ **Requirements:** Python β‰₯ 3.10, PyTorch β‰₯ 2.5.0, CUDA β‰₯ 12.0 Β· [Full Quick Start β†’](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
84
+
85
+ ### Text-to-Speech
86
+
87
+ ```python
88
+ from voxcpm import VoxCPM
89
+ import soundfile as sf
90
+
91
+ model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
92
+
93
+ wav = model.generate(
94
+ text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
95
+ cfg_value=2.0,
96
+ inference_timesteps=10,
97
+ )
98
+ sf.write("output.wav", wav, model.tts_model.sample_rate)
99
+ ```
100
+
101
+ ### Voice Design
102
+
103
+ Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
104
+
105
+ ```python
106
+ wav = model.generate(
107
+ text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
108
+ cfg_value=2.0,
109
+ inference_timesteps=10,
110
+ )
111
+ sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
112
+ ```
113
+
114
+ ### Controllable Voice Cloning
115
+
116
+ ```python
117
+ # Basic cloning
118
+ wav = model.generate(
119
+ text="This is a cloned voice generated by VoxCPM2.",
120
+ reference_wav_path="speaker.wav",
121
+ )
122
+ sf.write("clone.wav", wav, model.tts_model.sample_rate)
123
+
124
+ # Cloning with style control
125
+ wav = model.generate(
126
+ text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
127
+ reference_wav_path="speaker.wav",
128
+ cfg_value=2.0,
129
+ inference_timesteps=10,
130
+ )
131
+ sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
132
+ ```
133
+
134
+ ### Ultimate Cloning
135
+
136
+ Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
137
+
138
+ ```python
139
+ wav = model.generate(
140
+ text="This is an ultimate cloning demonstration using VoxCPM2.",
141
+ prompt_wav_path="speaker_reference.wav",
142
+ prompt_text="The transcript of the reference audio.",
143
+ reference_wav_path="speaker_reference.wav",
144
+ )
145
+ sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
146
+ ```
147
+
148
+ ### Streaming
149
+
150
+ ```python
151
+ import numpy as np
152
+
153
+ chunks = []
154
+ for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
155
+ chunks.append(chunk)
156
+ wav = np.concatenate(chunks)
157
+ sf.write("streaming.wav", wav, model.tts_model.sample_rate)
158
+ ```
159
+
160
+ ## Model Details
161
+
162
+ | Property | Value |
163
+ |---|---|
164
+ | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc β†’ TSLM β†’ RALM β†’ LocDiT) |
165
+ | Backbone | Based on MiniCPM-4, totally 2B parameters |
166
+ | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in β†’ 48kHz out) |
167
+ | Training Data | 2M+ hours multilingual speech |
168
+ | LM Token Rate | 6.25 Hz |
169
+ | Max Sequence Length | 8192 tokens |
170
+ | dtype | bfloat16 |
171
+ | VRAM | ~8 GB |
172
+ | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
173
+
174
+ ## Performance
175
+
176
+ VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
177
+
178
+ See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
179
+
180
+ ## Fine-tuning
181
+
182
+ VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
183
+
184
+ ```bash
185
+ # LoRA fine-tuning (recommended)
186
+ python scripts/train_voxcpm_finetune.py \
187
+ --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
188
+
189
+ # Full fine-tuning
190
+ python scripts/train_voxcpm_finetune.py \
191
+ --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
192
+ ```
193
+
194
+ See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
195
+
196
+ ## Limitations
197
+
198
+ - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
199
+ - Performance varies across languages depending on training data availability.
200
+ - Occasional instability may occur with very long or highly expressive inputs.
201
+ - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
202
+
203
+ ## Citation
204
+
205
+ ```bibtex
206
+ @article{voxcpm2_2026,
207
+ title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
208
+ author = {VoxCPM Team},
209
+ journal = {GitHub},
210
+ year = {2026},
211
+ }
212
+
213
+ @article{voxcpm2025,
214
+ title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
215
+ author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
216
+ Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
217
+ Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
218
+ journal = {arXiv preprint arXiv:2509.24650},
219
+ year = {2025},
220
+ }
221
+ ```
222
+
223
+ ## License
224
+
225
+ Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3cda1f511cb49a7a2c661ddd80fbec20435144811a84cd511f29acf39421e4b3
3
  size 4580080592
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7f964cfa9da23653baec6e6f7750719977ad944ed9f95fe52fe3a620506891d
3
  size 4580080592