| --- |
| language: |
| - zh |
| - en |
| - ar |
| - my |
| - da |
| - nl |
| - fi |
| - fr |
| - de |
| - el |
| - he |
| - hi |
| - id |
| - it |
| - ja |
| - km |
| - ko |
| - lo |
| - ms |
| - no |
| - pl |
| - pt |
| - ru |
| - es |
| - sw |
| - sv |
| - tl |
| - th |
| - tr |
| - vi |
| license: apache-2.0 |
| library_name: voxcpm |
| tags: |
| - text-to-speech |
| - tts |
| - multilingual |
| - voice-cloning |
| - voice-design |
| - diffusion |
| - audio |
| pipeline_tag: text-to-speech |
| --- |
| |
| # VoxCPM2 |
|
|
| **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data. |
|
|
| [](https://github.com/OpenBMB/VoxCPM) |
| [](https://voxcpm.readthedocs.io/en/latest/) |
| [](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) |
| [](https://openbmb.github.io/voxcpm2-demopage) |
| [](https://discord.gg/KZUx7tVNwz) |
|
|
| ## Highlights |
|
|
| - 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly |
| - 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required |
| - 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre |
| - 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced |
| - 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed |
| - 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content |
| - ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm) |
| - 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use |
|
|
| <details> |
| <summary><b>Supported Languages (30)</b></summary> |
|
|
| Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese |
|
|
| Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话 |
| </details> |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install voxcpm |
| ``` |
|
|
| **Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html) |
|
|
| ### Text-to-Speech |
|
|
| ```python |
| from voxcpm import VoxCPM |
| import soundfile as sf |
| |
| model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False) |
| |
| wav = model.generate( |
| text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.", |
| cfg_value=2.0, |
| inference_timesteps=10, |
| ) |
| sf.write("output.wav", wav, model.tts_model.sample_rate) |
| ``` |
|
|
| ### Voice Design |
|
|
| Put the voice description in parentheses at the start of `text`, followed by the content to synthesize: |
|
|
| ```python |
| wav = model.generate( |
| text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!", |
| cfg_value=2.0, |
| inference_timesteps=10, |
| ) |
| sf.write("voice_design.wav", wav, model.tts_model.sample_rate) |
| ``` |
|
|
| ### Controllable Voice Cloning |
|
|
| ```python |
| # Basic cloning |
| wav = model.generate( |
| text="This is a cloned voice generated by VoxCPM2.", |
| reference_wav_path="speaker.wav", |
| ) |
| sf.write("clone.wav", wav, model.tts_model.sample_rate) |
| |
| # Cloning with style control |
| wav = model.generate( |
| text="(slightly faster, cheerful tone)This is a cloned voice with style control.", |
| reference_wav_path="speaker.wav", |
| cfg_value=2.0, |
| inference_timesteps=10, |
| ) |
| sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate) |
| ``` |
|
|
| ### Ultimate Cloning |
|
|
| Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity: |
|
|
| ```python |
| wav = model.generate( |
| text="This is an ultimate cloning demonstration using VoxCPM2.", |
| prompt_wav_path="speaker_reference.wav", |
| prompt_text="The transcript of the reference audio.", |
| reference_wav_path="speaker_reference.wav", |
| ) |
| sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate) |
| ``` |
|
|
| ### Streaming |
|
|
| ```python |
| import numpy as np |
| |
| chunks = [] |
| for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"): |
| chunks.append(chunk) |
| wav = np.concatenate(chunks) |
| sf.write("streaming.wav", wav, model.tts_model.sample_rate) |
| ``` |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) | |
| | Backbone | Based on MiniCPM-4, totally 2B parameters | |
| | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) | |
| | Training Data | 2M+ hours multilingual speech | |
| | LM Token Rate | 6.25 Hz | |
| | Max Sequence Length | 8192 tokens | |
| | dtype | bfloat16 | |
| | VRAM | ~8 GB | |
| | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) | |
|
|
| ## Performance |
|
|
| VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks. |
|
|
| See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test). |
|
|
| ## Fine-tuning |
|
|
| VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio: |
|
|
| ```bash |
| # LoRA fine-tuning (recommended) |
| python scripts/train_voxcpm_finetune.py \ |
| --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml |
| |
| # Full fine-tuning |
| python scripts/train_voxcpm_finetune.py \ |
| --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml |
| ``` |
|
|
| See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions. |
|
|
| ## Limitations |
|
|
| - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output. |
| - Performance varies across languages depending on training data availability. |
| - Occasional instability may occur with very long or highly expressive inputs. |
| - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{voxcpm2_2026, |
| title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning}, |
| author = {VoxCPM Team}, |
| journal = {GitHub}, |
| year = {2026}, |
| } |
| |
| @article{voxcpm2025, |
| title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning}, |
| author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and |
| Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and |
| Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan}, |
| journal = {arXiv preprint arXiv:2509.24650}, |
| year = {2025}, |
| } |
| ``` |
|
|
| ## License |
|
|
| Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case. |
|
|