Duplicate from openbmb/VoxCPM2

Browse files

Co-authored-by: Yixuan Zhou <zhouyx1998@users.noreply.huggingface.co>

Files changed (8) hide show

.gitattributes +35 -0
README.md +225 -0
audiovae.pth +3 -0
config.json +67 -0
model.safetensors +3 -0
special_tokens_map.json +81 -0
tokenizer.json +0 -0
tokenizer_config.json +212 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,225 @@

+---
+language:
+- zh
+- en
+- ar
+- my
+- da
+- nl
+- fi
+- fr
+- de
+- el
+- he
+- hi
+- id
+- it
+- ja
+- km
+- ko
+- lo
+- ms
+- no
+- pl
+- pt
+- ru
+- es
+- sw
+- sv
+- tl
+- th
+- tr
+- vi
+license: apache-2.0
+library_name: voxcpm
+tags:
+- text-to-speech
+- tts
+- multilingual
+- voice-cloning
+- voice-design
+- diffusion
+- audio
+pipeline_tag: text-to-speech
+---
+# VoxCPM2
+**VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
+[![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
+[![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
+[![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
+[![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
+[![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
+## Highlights
+- 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly
+- 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
+- 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
+- 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
+- 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
+- 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content
+- ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13  accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)
+- 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use
+<details>
+<summary><b>Supported Languages (30)</b></summary>
+Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
+Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
+</details>
+## Quick Start
+### Installation
+```bash
+pip install voxcpm
+```
+**Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
+### Text-to-Speech
+```python
+from voxcpm import VoxCPM
+import soundfile as sf
+model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
+wav = model.generate(
+    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
+    cfg_value=2.0,
+    inference_timesteps=10,
+)
+sf.write("output.wav", wav, model.tts_model.sample_rate)
+```
+### Voice Design
+Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
+```python
+wav = model.generate(
+    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
+    cfg_value=2.0,
+    inference_timesteps=10,
+)
+sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
+```
+### Controllable Voice Cloning
+```python
+# Basic cloning
+wav = model.generate(
+    text="This is a cloned voice generated by VoxCPM2.",
+    reference_wav_path="speaker.wav",
+)
+sf.write("clone.wav", wav, model.tts_model.sample_rate)
+# Cloning with style control
+wav = model.generate(
+    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
+    reference_wav_path="speaker.wav",
+    cfg_value=2.0,
+    inference_timesteps=10,
+)
+sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
+```
+### Ultimate Cloning
+Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
+```python
+wav = model.generate(
+    text="This is an ultimate cloning demonstration using VoxCPM2.",
+    prompt_wav_path="speaker_reference.wav",
+    prompt_text="The transcript of the reference audio.",
+    reference_wav_path="speaker_reference.wav",
+)
+sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
+```
+### Streaming
+```python
+import numpy as np
+chunks = []
+for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
+    chunks.append(chunk)
+wav = np.concatenate(chunks)
+sf.write("streaming.wav", wav, model.tts_model.sample_rate)
+```
+## Model Details
+| Property | Value |
+|---|---|
+| Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
+| Backbone | Based on MiniCPM-4, totally 2B parameters |
+| Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
+| Training Data | 2M+ hours multilingual speech |
+| LM Token Rate | 6.25 Hz |
+| Max Sequence Length | 8192 tokens |
+| dtype | bfloat16 |
+| VRAM | ~8 GB |
+| RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
+## Performance
+VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
+See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
+## Fine-tuning
+VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
+```bash
+# LoRA fine-tuning (recommended)
+python scripts/train_voxcpm_finetune.py \
+    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
+# Full fine-tuning
+python scripts/train_voxcpm_finetune.py \
+    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
+```
+See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
+## Limitations
+- Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
+- Performance varies across languages depending on training data availability.
+- Occasional instability may occur with very long or highly expressive inputs.
+- **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
+## Citation
+```bibtex
+@article{voxcpm2_2026,
+  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
+  author  = {VoxCPM Team},
+  journal = {GitHub},
+  year    = {2026},
+}
+@article{voxcpm2025,
+  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
+  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
+             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
+             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
+  journal = {arXiv preprint arXiv:2509.24650},
+  year    = {2025},
+}
+```
+## License
+Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.

audiovae.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:94b5d51e107e0507d4acc976cfdadb64edd6fd06d1f751dadbf2fd1594274bf1
+size 376951122

config.json ADDED Viewed

	@@ -0,0 +1,67 @@

+{
+    "architecture": "voxcpm2",
+    "lm_config": {
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "hidden_size": 2048,
+        "intermediate_size": 6144,
+        "max_position_embeddings": 32768,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 28,
+        "num_key_value_heads": 2,
+        "rms_norm_eps": 1e-05,
+        "rope_theta": 10000,
+        "kv_channels": 128,
+        "rope_scaling": {
+            "type": "longrope",
+            "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.615569542115128, 5.2684819496549835, 6.014438591970396, 6.858830049237097, 7.804668263503327, 8.851768731513417, 9.99600492938444, 11.228766118181639, 12.536757560834843, 13.902257701387796, 15.303885189125953, 16.717837610115794, 18.119465097853947, 19.484965238406907, 20.792956681060105, 22.02571786985731, 23.16995406772833, 24.217054535738416, 25.16289275000465, 26.007284207271347, 26.753240849586767, 27.40615325712662, 27.973003419175363, 28.461674954469114, 28.880393889607006, 29.237306864684626, 29.540186419591297, 29.79624387177199, 30.01202719065413, 30.193382037992453, 30.34545697551969, 30.47273746338473, 30.579096895249787, 30.66785612408345, 30.741845563814174, 30.80346599254902, 30.85474569563567, 30.897392663720595, 30.932841297560394, 30.962293553185553, 30.986754758742034, 31.007064503249293, 31.02392307921529],
+            "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.615569542115128, 5.2684819496549835, 6.014438591970396, 6.858830049237097, 7.804668263503327, 8.851768731513417, 9.99600492938444, 11.228766118181639, 12.536757560834843, 13.902257701387796, 15.303885189125953, 16.717837610115794, 18.119465097853947, 19.484965238406907, 20.792956681060105, 22.02571786985731, 23.16995406772833, 24.217054535738416, 25.16289275000465, 26.007284207271347, 26.753240849586767, 27.40615325712662, 27.973003419175363, 28.461674954469114, 28.880393889607006, 29.237306864684626, 29.540186419591297, 29.79624387177199, 30.01202719065413, 30.193382037992453, 30.34545697551969, 30.47273746338473, 30.579096895249787, 30.66785612408345, 30.741845563814174, 30.80346599254902, 30.85474569563567, 30.897392663720595, 30.932841297560394, 30.962293553185553, 30.986754758742034, 31.007064503249293, 31.02392307921529],
+            "original_max_position_embeddings": 32768
+        },
+        "vocab_size": 73448,
+        "use_mup": false,
+        "scale_emb": 12,
+        "dim_model_base": 256,
+        "scale_depth": 1.4
+    },
+    "patch_size": 4,
+    "feat_dim": 64,
+    "scalar_quantization_latent_dim": 512,
+    "scalar_quantization_scale": 9,
+    "residual_lm_num_layers": 8,
+    "residual_lm_no_rope": true,
+    "encoder_config": {
+        "hidden_dim": 1024,
+        "ffn_dim": 4096,
+        "num_heads": 16,
+        "num_layers": 12,
+        "kv_channels": 128
+    },
+    "dit_config": {
+        "hidden_dim": 1024,
+        "ffn_dim": 4096,
+        "num_heads": 16,
+        "num_layers": 12,
+        "kv_channels": 128,
+        "mean_mode": false,
+        "cfm_config": {
+            "sigma_min": 1e-06,
+            "solver": "euler",
+            "t_scheduler": "log-norm",
+            "inference_cfg_rate": 2.0
+        }
+    },
+    "audio_vae_config": {
+        "encoder_dim": 128,
+        "encoder_rates": [2, 5, 8, 8],
+        "latent_dim": 64,
+        "decoder_dim": 2048,
+        "decoder_rates": [8, 6, 5, 2, 2, 2],
+        "sr_bin_boundaries": [20000, 30000, 40000],
+        "sample_rate": 16000,
+        "out_sample_rate": 48000
+    },
+    "max_length": 8192,
+    "device": "cuda",
+    "dtype": "bfloat16"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7f964cfa9da23653baec6e6f7750719977ad944ed9f95fe52fe3a620506891d
+size 4580080592

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|execute_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|execute_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,212 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "<|audio_prompt_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "104": {
+      "content": "<|audio_prompt_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "105": {
+      "content": "<|background|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "106": {
+      "content": "<|/background|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "107": {
+      "content": "<|characters|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "108": {
+      "content": "<|/characters|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "109": {
+      "content": "<|speaker_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "110": {
+      "content": "<|/speaker_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "111": {
+      "content": "<|span|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "112": {
+      "content": "<|/span|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73440": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73441": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73442": {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73443": {
+      "content": "<|execute_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73444": {
+      "content": "<|execute_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73445": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73446": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73447": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_end|>",
+    "<|im_start|>",
+    "<|tool_call|>",
+    "<|execute_start|>",
+    "<|execute_end|>",
+    "<|fim_prefix|>",
+    "<|fim_middle|>",
+    "<|fim_suffix|>"
+  ],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "legacy": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false,
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
+}