mlx-community
/

VoxCPM2-4bit

+---
+license: apache-2.0
+base_model: openbmb/VoxCPM2
+tags:
+  - mlx
+  - tts
+  - text-to-speech
+  - voice-cloning
+  - voice-design
+  - multilingual
+library_name: mlx-audio
+pipeline_tag: text-to-speech
+language:
+  - en
+  - zh
+  - id
+  - ja
+  - ko
+  - multilingual
+---
+# VoxCPM2 - 4-bit quantized
+MLX port of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2) — a 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, and voice design.
+4-bit quantized (LM layers only, VAE/DiT at full precision). Fastest, smallest, with minimal quality loss.
+## Features
+- **30 languages** — including English, Chinese, Indonesian, Japanese, Korean, and more
+- **48kHz output** — studio-quality audio
+- **Voice Design** — create voices from text descriptions (no reference audio needed)
+- **Voice Cloning** — clone any voice from a short audio reference
+- **4 generation modes** — zero-shot, continuation, reference cloning, combined
+## Usage
+```bash
+pip install mlx-audio
+# Zero-shot
+python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit --text "Hello world" --verbose
+# Voice design
+python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
+  --text "Hello world" \
+  --instruct "A young woman, gentle and sweet voice"
+# Voice cloning
+python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
+  --text "Hello world" \
+  --ref_audio speaker.wav --ref_text "reference text"
+```
+### Python API
+```python
+from mlx_audio.tts import load_model
+model = load_model("mlx-community/VoxCPM2-4bit")
+# Generate
+for result in model.generate(
+    text="Hello, this is VoxCPM2 on Apple Silicon.",
+    inference_timesteps=7,
+    cfg_value=2.0,
+):
+    print(f"Duration: {result.audio_duration}")
+```
+## Performance (Apple Silicon)
+| Variant | Size | RTF (7 timesteps) |
+|---------|------|--------------------|
+| bf16 | 4.96 GB | 0.48x |
+| **8-bit** | **3.23 GB** | **0.85x** |
+| **4-bit** | **2.30 GB** | **0.90x** |
+*RTF = Real-Time Factor (>1.0 = faster than realtime)*
+## Original Model
+- [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
+- Apache 2.0 License
+Converted with [mlx-audio](https://github.com/Blaizzy/mlx-audio).