VoxCPM2-ONNX
ONNX Runtime models for VoxCPM2 — a 2B-parameter text-to-speech model by OpenBMB with voice cloning capability.
Run VoxCPM2 voice cloning on any CPU — no GPU needed.
Models
| File | Size | Description |
|---|---|---|
audio_vae_encoder.onnx + .data |
~185 MB | Audio waveform to latent features |
audio_vae_decoder.onnx + .data |
~176 MB | Latent features to 48 kHz waveform |
voxcpm2_prefill.onnx + .data |
~7.8 GB | Text + reference audio to KV cache + DiT hidden state |
voxcpm2_decode_step.onnx + .data |
~8.1 GB | Single autoregressive decode step (10 CFM steps baked in) |
All models use external_data=True — keep .onnx and .onnx.data files together.
Usage
# Install
pip install voxcpm>=2.0.2 torch>=2.4.0 onnxruntime>=1.18.0 soundfile numpy tqdm huggingface_hub
# Download models
python -c "
from huggingface_hub import snapshot_download
snapshot_download('ai4all8/VoxCPM2-ONNX', local_dir='./onnx_models', ignore_patterns=['*.md', '*.txt'])
"
# Download VoxCPM2 PyTorch weights (needed for preprocessing)
python -c "from voxcpm import VoxCPM; VoxCPM.from_pretrained('openbmb/VoxCPM2')"
# Run inference (see GitHub repo for full CLI)
git clone https://github.com/ai4all8/VoxCPM2-ONNX.git
cd VoxCPM2-ONNX
python infer.py --text "Hello!" --ref_wav speaker.wav --ref_text "Reference transcript."
Full documentation and code: github.com/ai4all8/VoxCPM2-ONNX
Languages
- Cantonese (粵語)
- Mandarin (普通話)
- English
- Japanese (日本語)
Performance
| Platform | RTF | Notes |
|---|---|---|
| AMD Ryzen 9 (Windows) | ~4.5x | 8 cores, ORT sequential |
| Intel Core (Linux) | ~9.5x | Single-threaded |
RTF = Real-Time Factor (lower is better; 1.0 = real-time).
License
Apache License 2.0, same as the original VoxCPM2 model.
Attribution
Original model: VoxCPM2 by OpenBMB (Apache 2.0, Copyright OpenBMB / Tsinghua University).