openbmb
/

VoxCPM2

+---
+language:
+- zh
+- en
+- ar
+- my
+- da
+- nl
+- fi
+- fr
+- de
+- el
+- he
+- hi
+- id
+- it
+- ja
+- km
+- ko
+- lo
+- ms
+- no
+- pl
+- pt
+- ru
+- es
+- sw
+- sv
+- tl
+- th
+- tr
+- vi
+license: apache-2.0
+library_name: voxcpm2
+tags:
+- text-to-speech
+- tts
+- multilingual
+- voice-cloning
+- voice-design
+- diffusion
+- audio
+pipeline_tag: text-to-speech
+---
+# VoxCPM2
+**VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
+[![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
+[![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
+[![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
+[![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
+[![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
+## Highlights
+- 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly
+- 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
+- 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
+- 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
+- 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
+- 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content
+- ⚡ **Real-Time Streaming** — RTF ~0.13 on RTX 4090 with [Nano-vLLM](https://github.com/a710128/nanovllm-voxcpm)
+- 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use
+<details>
+<summary><b>Supported Languages (30)</b></summary>
+Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
+Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
+</details>
+## Quick Start
+### Installation
+```bash
+pip install voxcpm
+```
+**Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
+### Text-to-Speech
+```python
+from voxcpm import VoxCPM
+import soundfile as sf
+model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
+wav = model.generate(
+    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
+    cfg_value=2.0,
+    inference_timesteps=10,
+)
+sf.write("output.wav", wav, model.tts_model.sample_rate)
+```
+### Voice Design
+Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
+```python
+wav = model.generate(
+    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
+    cfg_value=2.0,
+    inference_timesteps=10,
+)
+sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
+```
+### Controllable Voice Cloning
+```python
+# Basic cloning
+wav = model.generate(
+    text="This is a cloned voice generated by VoxCPM2.",
+    reference_wav_path="speaker.wav",
+)
+sf.write("clone.wav", wav, model.tts_model.sample_rate)
+# Cloning with style control
+wav = model.generate(
+    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
+    reference_wav_path="speaker.wav",
+    cfg_value=2.0,
+    inference_timesteps=10,
+)
+sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
+```
+### Ultimate Cloning
+Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
+```python
+wav = model.generate(
+    text="This is an ultimate cloning demonstration using VoxCPM2.",
+    prompt_wav_path="speaker_reference.wav",
+    prompt_text="The transcript of the reference audio.",
+    reference_wav_path="speaker_reference.wav",
+)
+sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
+```
+### Streaming
+```python
+import numpy as np
+chunks = []
+for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
+    chunks.append(chunk)
+wav = np.concatenate(chunks)
+sf.write("streaming.wav", wav, model.tts_model.sample_rate)
+```
+## Model Details
+| Property | Value |
+|---|---|
+| Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
+| Backbone | Based on MiniCPM-4, totally 2B parameters |
+| Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
+| Training Data | 2M+ hours multilingual speech |
+| LM Token Rate | 6.25 Hz |
+| Max Sequence Length | 8192 tokens |
+| dtype | bfloat16 |
+| VRAM | ~8 GB |
+| RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
+## Performance
+VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
+See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
+## Fine-tuning
+VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
+```bash
+# LoRA fine-tuning (recommended)
+python scripts/train_voxcpm_finetune.py \
+    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
+# Full fine-tuning
+python scripts/train_voxcpm_finetune.py \
+    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
+```
+See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
+## Limitations
+- Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
+- Performance varies across languages depending on training data availability.
+- Occasional instability may occur with very long or highly expressive inputs.
+- **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
+## Citation
+```bibtex
+@article{voxcpm2_2026,
+  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
+  author  = {VoxCPM Team},
+  journal = {GitHub},
+  year    = {2026},
+}
+@article{voxcpm2025,
+  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
+  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
+             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
+             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
+  journal = {arXiv preprint arXiv:2509.24650},
+  year    = {2025},
+}
+```
+## License
+Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3cda1f511cb49a7a2c661ddd80fbec20435144811a84cd511f29acf39421e4b3
 size 4580080592

 version https://git-lfs.github.com/spec/v1
+oid sha256:f7f964cfa9da23653baec6e6f7750719977ad944ed9f95fe52fe3a620506891d
 size 4580080592