File size: 7,598 Bytes
07f4cde c2a4b03 07f4cde 4a2dd51 07f4cde | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | ---
language:
- zh
- en
- ar
- my
- da
- nl
- fi
- fr
- de
- el
- he
- hi
- id
- it
- ja
- km
- ko
- lo
- ms
- no
- pl
- pt
- ru
- es
- sw
- sv
- tl
- th
- tr
- vi
license: apache-2.0
library_name: voxcpm
tags:
- text-to-speech
- tts
- multilingual
- voice-cloning
- voice-design
- diffusion
- audio
pipeline_tag: text-to-speech
---
# VoxCPM2
**VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model β **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
[](https://github.com/OpenBMB/VoxCPM)
[](https://voxcpm.readthedocs.io/en/latest/)
[](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
[](https://openbmb.github.io/voxcpm2-demopage)
[](https://discord.gg/KZUx7tVNwz)
## Highlights
- π **30-Language Multilingual** β No language tag needed; input text in any supported language directly
- π¨ **Voice Design** β Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, paceβ¦); no reference audio required
- ποΈ **Controllable Cloning** β Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
- ποΈ **Ultimate Cloning** β Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
- π **48kHz Studio-Quality Output** β Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
- π§ **Context-Aware Synthesis** β Automatically infers appropriate prosody and expressiveness from text content
- β‘ **Real-Time Streaming** β RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)
- π **Fully Open-Source & Commercial-Ready** β Apache-2.0 license, free for commercial use
<details>
<summary><b>Supported Languages (30)</b></summary>
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
Chinese Dialects: εε·θ―, η²€θ―, ε΄θ―, δΈεθ―, ζ²³εθ―, ιθ₯Ώθ―, ε±±δΈθ―, 倩ζ΄₯θ―, ι½εθ―
</details>
## Quick Start
### Installation
```bash
pip install voxcpm
```
**Requirements:** Python β₯ 3.10, PyTorch β₯ 2.5.0, CUDA β₯ 12.0 Β· [Full Quick Start β](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
### Text-to-Speech
```python
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
```
### Voice Design
Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
```python
wav = model.generate(
text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
```
### Controllable Voice Cloning
```python
# Basic cloning
wav = model.generate(
text="This is a cloned voice generated by VoxCPM2.",
reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)
# Cloning with style control
wav = model.generate(
text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
reference_wav_path="speaker.wav",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
```
### Ultimate Cloning
Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
```python
wav = model.generate(
text="This is an ultimate cloning demonstration using VoxCPM2.",
prompt_wav_path="speaker_reference.wav",
prompt_text="The transcript of the reference audio.",
reference_wav_path="speaker_reference.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
```
### Streaming
```python
import numpy as np
chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)
```
## Model Details
| Property | Value |
|---|---|
| Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc β TSLM β RALM β LocDiT) |
| Backbone | Based on MiniCPM-4, totally 2B parameters |
| Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in β 48kHz out) |
| Training Data | 2M+ hours multilingual speech |
| LM Token Rate | 6.25 Hz |
| Max Sequence Length | 8192 tokens |
| dtype | bfloat16 |
| VRAM | ~8 GB |
| RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
## Performance
VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
## Fine-tuning
VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5β10 minutes of audio:
```bash
# LoRA fine-tuning (recommended)
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
```
See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
## Limitations
- Voice Design and Style Control results may vary between runs; generating 1β3 times is recommended to obtain the desired output.
- Performance varies across languages depending on training data availability.
- Occasional instability may occur with very long or highly expressive inputs.
- **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
## Citation
```bibtex
@article{voxcpm2_2026,
title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
author = {VoxCPM Team},
journal = {GitHub},
year = {2026},
}
@article{voxcpm2025,
title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
journal = {arXiv preprint arXiv:2509.24650},
year = {2025},
}
```
## License
Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.
|