VoxCPM2-ONNX

ONNX Runtime models for VoxCPM2 — a 2B-parameter text-to-speech model by OpenBMB with voice cloning capability.

Run VoxCPM2 voice cloning on any CPU — no GPU needed.

Models

File	Size	Description
`audio_vae_encoder.onnx` + `.data`	~185 MB	Audio waveform to latent features
`audio_vae_decoder.onnx` + `.data`	~176 MB	Latent features to 48 kHz waveform
`voxcpm2_prefill.onnx` + `.data`	~7.8 GB	Text + reference audio to KV cache + DiT hidden state
`voxcpm2_decode_step.onnx` + `.data`	~8.1 GB	Single autoregressive decode step (10 CFM steps baked in)

All models use external_data=True — keep .onnx and .onnx.data files together.

Usage

# Install
pip install voxcpm>=2.0.2 torch>=2.4.0 onnxruntime>=1.18.0 soundfile numpy tqdm huggingface_hub

# Download models
python -c "
from huggingface_hub import snapshot_download
snapshot_download('ai4all8/VoxCPM2-ONNX', local_dir='./onnx_models', ignore_patterns=['*.md', '*.txt'])
"

# Download VoxCPM2 PyTorch weights (needed for preprocessing)
python -c "from voxcpm import VoxCPM; VoxCPM.from_pretrained('openbmb/VoxCPM2')"

# Run inference (see GitHub repo for full CLI)
git clone https://github.com/ai4all8/VoxCPM2-ONNX.git
cd VoxCPM2-ONNX
python infer.py --text "Hello!" --ref_wav speaker.wav --ref_text "Reference transcript."

Full documentation and code: github.com/ai4all8/VoxCPM2-ONNX

Languages

Cantonese (粵語)
Mandarin (普通話)
English
Japanese (日本語)

Performance

Platform	RTF	Notes
AMD Ryzen 9 (Windows)	~4.5x	8 cores, ORT sequential
Intel Core (Linux)	~9.5x	Single-threaded

RTF = Real-Time Factor (lower is better; 1.0 = real-time).

License

Apache License 2.0, same as the original VoxCPM2 model.

Attribution

Original model: VoxCPM2 by OpenBMB (Apache 2.0, Copyright OpenBMB / Tsinghua University).

Downloads last month: -; Downloads are not tracked for this model. How to track