File size: 2,206 Bytes
dc9e5c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: apache-2.0
base_model: openbmb/VoxCPM2
tags:
  - mlx
  - tts
  - text-to-speech
  - voice-cloning
  - voice-design
  - multilingual
library_name: mlx-audio
pipeline_tag: text-to-speech
language:
  - en
  - zh
  - id
  - ja
  - ko
  - multilingual
---

# VoxCPM2 - 4-bit quantized

MLX port of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2) — a 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, and voice design.

4-bit quantized (LM layers only, VAE/DiT at full precision). Fastest, smallest, with minimal quality loss.

## Features
- **30 languages** — including English, Chinese, Indonesian, Japanese, Korean, and more
- **48kHz output** — studio-quality audio
- **Voice Design** — create voices from text descriptions (no reference audio needed)
- **Voice Cloning** — clone any voice from a short audio reference
- **4 generation modes** — zero-shot, continuation, reference cloning, combined

## Usage

```bash
pip install mlx-audio

# Zero-shot
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit --text "Hello world" --verbose

# Voice design
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --instruct "A young woman, gentle and sweet voice"

# Voice cloning
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --ref_audio speaker.wav --ref_text "reference text"
```

### Python API

```python
from mlx_audio.tts import load_model

model = load_model("mlx-community/VoxCPM2-4bit")

# Generate
for result in model.generate(
    text="Hello, this is VoxCPM2 on Apple Silicon.",
    inference_timesteps=7,
    cfg_value=2.0,
):
    print(f"Duration: {result.audio_duration}")
```

## Performance (Apple Silicon)

| Variant | Size | RTF (7 timesteps) |
|---------|------|--------------------|
| bf16 | 4.96 GB | 0.48x |
| **8-bit** | **3.23 GB** | **0.85x** |
| **4-bit** | **2.30 GB** | **0.90x** |

*RTF = Real-Time Factor (>1.0 = faster than realtime)*

## Original Model
- [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
- Apache 2.0 License

Converted with [mlx-audio](https://github.com/Blaizzy/mlx-audio).