Instructions to use chfm/VoxCPM2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VoxCPM
How to use chfm/VoxCPM2 with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("chfm/VoxCPM2") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - zh | |
| - en | |
| - ar | |
| - my | |
| - da | |
| - nl | |
| - fi | |
| - fr | |
| - de | |
| - el | |
| - he | |
| - hi | |
| - id | |
| - it | |
| - ja | |
| - km | |
| - ko | |
| - lo | |
| - ms | |
| - no | |
| - pl | |
| - pt | |
| - ru | |
| - es | |
| - sw | |
| - sv | |
| - tl | |
| - th | |
| - tr | |
| - vi | |
| license: apache-2.0 | |
| library_name: voxcpm | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - multilingual | |
| - voice-cloning | |
| - voice-design | |
| - diffusion | |
| - audio | |
| pipeline_tag: text-to-speech | |
| # VoxCPM2 | |
| **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data. | |
| [](https://github.com/OpenBMB/VoxCPM) | |
| [](https://voxcpm.readthedocs.io/en/latest/) | |
| [](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) | |
| [](https://openbmb.github.io/voxcpm2-demopage) | |
| [](https://discord.gg/KZUx7tVNwz) | |
| [](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f) | |
| ## Highlights | |
| - 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly | |
| - 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required | |
| - 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre | |
| - 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced | |
| - 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed | |
| - 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content | |
| - ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm) | |
| - 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use | |
| <summary><b>Supported Languages (30)</b></summary> | |
| Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese | |
| Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话 | |
| ## Quick Start | |
| ### Installation | |
| ```bash | |
| pip install voxcpm | |
| ``` | |
| **Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html) | |
| ### Text-to-Speech | |
| ```python | |
| from voxcpm import VoxCPM | |
| import soundfile as sf | |
| model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False) | |
| wav = model.generate( | |
| text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.", | |
| cfg_value=2.0, | |
| inference_timesteps=10, | |
| ) | |
| sf.write("output.wav", wav, model.tts_model.sample_rate) | |
| ``` | |
| ### Voice Design | |
| Put the voice description in parentheses at the start of `text`, followed by the content to synthesize: | |
| ```python | |
| wav = model.generate( | |
| text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!", | |
| cfg_value=2.0, | |
| inference_timesteps=10, | |
| ) | |
| sf.write("voice_design.wav", wav, model.tts_model.sample_rate) | |
| ``` | |
| ### Controllable Voice Cloning | |
| ```python | |
| # Basic cloning | |
| wav = model.generate( | |
| text="This is a cloned voice generated by VoxCPM2.", | |
| reference_wav_path="speaker.wav", | |
| ) | |
| sf.write("clone.wav", wav, model.tts_model.sample_rate) | |
| # Cloning with style control | |
| wav = model.generate( | |
| text="(slightly faster, cheerful tone)This is a cloned voice with style control.", | |
| reference_wav_path="speaker.wav", | |
| cfg_value=2.0, | |
| inference_timesteps=10, | |
| ) | |
| sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate) | |
| ``` | |
| ### Ultimate Cloning | |
| Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity: | |
| ```python | |
| wav = model.generate( | |
| text="This is an ultimate cloning demonstration using VoxCPM2.", | |
| prompt_wav_path="speaker_reference.wav", | |
| prompt_text="The transcript of the reference audio.", | |
| reference_wav_path="speaker_reference.wav", | |
| ) | |
| sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate) | |
| ``` | |
| ### Streaming | |
| ```python | |
| import numpy as np | |
| chunks = [] | |
| for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"): | |
| chunks.append(chunk) | |
| wav = np.concatenate(chunks) | |
| sf.write("streaming.wav", wav, model.tts_model.sample_rate) | |
| ``` | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) | | |
| | Backbone | Based on MiniCPM-4, totally 2B parameters | | |
| | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) | | |
| | Training Data | 2M+ hours multilingual speech | | |
| | LM Token Rate | 6.25 Hz | | |
| | Max Sequence Length | 8192 tokens | | |
| | dtype | bfloat16 | | |
| | VRAM | ~8 GB | | |
| | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) | | |
| ## Performance | |
| VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks. | |
| See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test). | |
| ## Fine-tuning | |
| VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio: | |
| ```bash | |
| # LoRA fine-tuning (recommended) | |
| python scripts/train_voxcpm_finetune.py \ | |
| --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml | |
| # Full fine-tuning | |
| python scripts/train_voxcpm_finetune.py \ | |
| --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml | |
| ``` | |
| See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions. | |
| ## Limitations | |
| - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output. | |
| - Performance varies across languages depending on training data availability. | |
| - Occasional instability may occur with very long or highly expressive inputs. | |
| - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled. | |
| ## Citation | |
| ```bibtex | |
| @article{voxcpm2_2026, | |
| title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning}, | |
| author = {VoxCPM Team}, | |
| journal = {GitHub}, | |
| year = {2026}, | |
| } | |
| @article{voxcpm2025, | |
| title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning}, | |
| author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and | |
| Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and | |
| Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan}, | |
| journal = {arXiv preprint arXiv:2509.24650}, | |
| year = {2025}, | |
| } | |
| ``` | |
| ## License | |
| Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case. | |