VoxCPM2 — MLX int8 (group-quantised)

8-bit MLX-compatible quantization for Apple Silicon.

MLX port of openbmb/VoxCPM2 — a 2B-parameter multilingual diffusion-autoregressive TTS model with 48 kHz studio-quality output, voice cloning, and instruction-driven voice design.

Part of soniqo.audio — an on-device speech toolkit for Apple Silicon. Consumed by the open-source speech-swift library (module VoxCPM2TTS).

Bundle size: 2.95 GB

Use cases

Speech generation — 48 kHz TTS with voice design and multilingual support.
Voice cloning — reference-audio cloning + ultimate cloning (audio + transcript).
CLI reference — speech speak --engine voxcpm2 ... flags.
Getting started — install speech-swift on macOS / iOS.

Variants

Variant	Size	Notes
bf16	~5.0 GB	Reference quality, no Linear quantization.
int8	~3.0 GB	8-bit group quantization. Mean rel-L2 0.53 % vs bf16.

Capabilities

30 languages including English, Chinese, Indonesian, Japanese, Korean
48 kHz output
Zero-shot synthesis — generate speech from text alone
Voice cloning — clone a target speaker from a single reference clip
Voice design — natural-language style control (e.g. "young female voice, warm and gentle")
Ultimate cloning — reference audio + transcript for prosody-preserving cloning
Streaming generation — patch-level decoding for low-latency synthesis

Quantization

Format: MLX QuantizedLinear, 8 bits per element, group size 64, per-group scales and biases stored as float16.
What is quantized: All Linear layers inside the LM backbones (base_lm, residual_lm), the DiT estimator decoder, feat_encoder.encoder, and the top-level projection heads (enc_to_lm_proj, lm_to_dit_proj, res_to_dit_proj, fusion_concat_proj, stop_proj, stop_head, fsq_layer.*, the time/delta-time MLPs).
What stays bfloat16: All audio_vae.* weights, RMSNorm / LayerNorm gain tensors, RoPE lookup tables, Snake alpha, embedding tables, and 1-D parameters.
Round-trip fidelity vs bf16: mean relative L2 error 0.53 %, worst-layer relative L2 0.78 % (stop_head).
40 % smaller than bf16 with negligible quality impact in practice.

Usage with `speech-swift`

This bundle is consumed by soniqo/speech-swift's VoxCPM2TTS Swift module.

import VoxCPM2TTS

let model = try await VoxCPM2TTSModel.fromPretrained(
    modelId: "aufklarer/VoxCPM2-MLX-int8"
)
let audio = try await model.generate(text: "Hello from VoxCPM2.", language: "english")

Or via the CLI:

speech speak "Hello from VoxCPM2." --engine voxcpm2 --voxcpm2-variant int8 -o hi.wav

Source

This bundle is converted from the upstream PyTorch weights at openbmb/VoxCPM2.

License

Apache 2.0 — inherited from the upstream openbmb/VoxCPM2 model.

Responsible use

Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.

Downloads last month: 430

Safetensors

Model size

0.8B params

Tensor type

F32

U32

I32

F16

MLX

Hardware compatibility

Quantized

Model tree for aufklarer/VoxCPM2-MLX-int8

Base model

openbmb/VoxCPM2

Quantized

(9)

this model

Collection including aufklarer/VoxCPM2-MLX-int8

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 82 items • Updated 1 day ago • 5