rezanaltjetlink commited on
Commit
793648b
·
verified ·
1 Parent(s): 73584f6

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -226
README.md DELETED
@@ -1,226 +0,0 @@
1
- ---
2
- language:
3
- - zh
4
- - en
5
- - ar
6
- - my
7
- - da
8
- - nl
9
- - fi
10
- - fr
11
- - de
12
- - el
13
- - he
14
- - hi
15
- - id
16
- - it
17
- - ja
18
- - km
19
- - ko
20
- - lo
21
- - ms
22
- - no
23
- - pl
24
- - pt
25
- - ru
26
- - es
27
- - sw
28
- - sv
29
- - tl
30
- - th
31
- - tr
32
- - vi
33
- license: apache-2.0
34
- library_name: voxcpm
35
- tags:
36
- - text-to-speech
37
- - tts
38
- - multilingual
39
- - voice-cloning
40
- - voice-design
41
- - diffusion
42
- - audio
43
- pipeline_tag: text-to-speech
44
- ---
45
-
46
- # VoxCPM2
47
-
48
- **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
49
-
50
- [![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
51
- [![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
52
- [![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
53
- [![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
54
- [![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
55
- [![Lark](https://img.shields.io/badge/飞书群-VoxCPM-00D6B9?logo=lark&logoColor=white)](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f)
56
-
57
- ## Highlights
58
-
59
- - 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly
60
- - 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
61
- - 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
62
- - 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
63
- - 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
64
- - 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content
65
- - ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)
66
- - 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use
67
-
68
-
69
- <summary><b>Supported Languages (30)</b></summary>
70
-
71
- Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
72
-
73
- Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
74
-
75
-
76
- ## Quick Start
77
-
78
- ### Installation
79
-
80
- ```bash
81
- pip install voxcpm
82
- ```
83
-
84
- **Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
85
-
86
- ### Text-to-Speech
87
-
88
- ```python
89
- from voxcpm import VoxCPM
90
- import soundfile as sf
91
-
92
- model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
93
-
94
- wav = model.generate(
95
- text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
96
- cfg_value=2.0,
97
- inference_timesteps=10,
98
- )
99
- sf.write("output.wav", wav, model.tts_model.sample_rate)
100
- ```
101
-
102
- ### Voice Design
103
-
104
- Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
105
-
106
- ```python
107
- wav = model.generate(
108
- text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
109
- cfg_value=2.0,
110
- inference_timesteps=10,
111
- )
112
- sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
113
- ```
114
-
115
- ### Controllable Voice Cloning
116
-
117
- ```python
118
- # Basic cloning
119
- wav = model.generate(
120
- text="This is a cloned voice generated by VoxCPM2.",
121
- reference_wav_path="speaker.wav",
122
- )
123
- sf.write("clone.wav", wav, model.tts_model.sample_rate)
124
-
125
- # Cloning with style control
126
- wav = model.generate(
127
- text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
128
- reference_wav_path="speaker.wav",
129
- cfg_value=2.0,
130
- inference_timesteps=10,
131
- )
132
- sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
133
- ```
134
-
135
- ### Ultimate Cloning
136
-
137
- Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
138
-
139
- ```python
140
- wav = model.generate(
141
- text="This is an ultimate cloning demonstration using VoxCPM2.",
142
- prompt_wav_path="speaker_reference.wav",
143
- prompt_text="The transcript of the reference audio.",
144
- reference_wav_path="speaker_reference.wav",
145
- )
146
- sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
147
- ```
148
-
149
- ### Streaming
150
-
151
- ```python
152
- import numpy as np
153
-
154
- chunks = []
155
- for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
156
- chunks.append(chunk)
157
- wav = np.concatenate(chunks)
158
- sf.write("streaming.wav", wav, model.tts_model.sample_rate)
159
- ```
160
-
161
- ## Model Details
162
-
163
- | Property | Value |
164
- |---|---|
165
- | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
166
- | Backbone | Based on MiniCPM-4, totally 2B parameters |
167
- | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
168
- | Training Data | 2M+ hours multilingual speech |
169
- | LM Token Rate | 6.25 Hz |
170
- | Max Sequence Length | 8192 tokens |
171
- | dtype | bfloat16 |
172
- | VRAM | ~8 GB |
173
- | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
174
-
175
- ## Performance
176
-
177
- VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
178
-
179
- See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
180
-
181
- ## Fine-tuning
182
-
183
- VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
184
-
185
- ```bash
186
- # LoRA fine-tuning (recommended)
187
- python scripts/train_voxcpm_finetune.py \
188
- --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
189
-
190
- # Full fine-tuning
191
- python scripts/train_voxcpm_finetune.py \
192
- --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
193
- ```
194
-
195
- See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
196
-
197
- ## Limitations
198
-
199
- - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
200
- - Performance varies across languages depending on training data availability.
201
- - Occasional instability may occur with very long or highly expressive inputs.
202
- - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
203
-
204
- ## Citation
205
-
206
- ```bibtex
207
- @article{voxcpm2_2026,
208
- title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
209
- author = {VoxCPM Team},
210
- journal = {GitHub},
211
- year = {2026},
212
- }
213
-
214
- @article{voxcpm2025,
215
- title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
216
- author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
217
- Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
218
- Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
219
- journal = {arXiv preprint arXiv:2509.24650},
220
- year = {2025},
221
- }
222
- ```
223
-
224
- ## License
225
-
226
- Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.