VoxCPM2-ONNX

ONNX Runtime models for VoxCPM2 — a 2B-parameter text-to-speech model by OpenBMB with voice cloning capability.

Run VoxCPM2 voice cloning on any CPU — no GPU needed.

Models

File Size Description
audio_vae_encoder.onnx + .data ~185 MB Audio waveform to latent features
audio_vae_decoder.onnx + .data ~176 MB Latent features to 48 kHz waveform
voxcpm2_prefill.onnx + .data ~7.8 GB Text + reference audio to KV cache + DiT hidden state
voxcpm2_decode_step.onnx + .data ~8.1 GB Single autoregressive decode step (10 CFM steps baked in)

All models use external_data=True — keep .onnx and .onnx.data files together.

Usage

# Install
pip install voxcpm>=2.0.2 torch>=2.4.0 onnxruntime>=1.18.0 soundfile numpy tqdm huggingface_hub

# Download models
python -c "
from huggingface_hub import snapshot_download
snapshot_download('ai4all8/VoxCPM2-ONNX', local_dir='./onnx_models', ignore_patterns=['*.md', '*.txt'])
"

# Download VoxCPM2 PyTorch weights (needed for preprocessing)
python -c "from voxcpm import VoxCPM; VoxCPM.from_pretrained('openbmb/VoxCPM2')"

# Run inference (see GitHub repo for full CLI)
git clone https://github.com/ai4all8/VoxCPM2-ONNX.git
cd VoxCPM2-ONNX
python infer.py --text "Hello!" --ref_wav speaker.wav --ref_text "Reference transcript."

Full documentation and code: github.com/ai4all8/VoxCPM2-ONNX

Languages

  • Cantonese (粵語)
  • Mandarin (普通話)
  • English
  • Japanese (日本語)

Performance

Platform RTF Notes
AMD Ryzen 9 (Windows) ~4.5x 8 cores, ORT sequential
Intel Core (Linux) ~9.5x Single-threaded

RTF = Real-Time Factor (lower is better; 1.0 = real-time).

License

Apache License 2.0, same as the original VoxCPM2 model.

Attribution

Original model: VoxCPM2 by OpenBMB (Apache 2.0, Copyright OpenBMB / Tsinghua University).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support