VoxCPM2 — LiteRT (INT8)

2 B-parameter multilingual TTS with voice cloning and voice design. 48 kHz output.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

LiteRT export of openbmb/VoxCPM2 — a 2 B-parameter diffusion-autoregressive TTS with 48 kHz studio-quality output, reference-audio voice cloning, and natural-language voice design. Designed for server-side synthesis workers and on-device TTS through the speech-core TTSInterface.

Why split graphs

VoxCPM2 is not a single feed-forward model. The runtime loop is

text + optional instruction ──► text-prefill
                                      │
                                      ▼
                              repeated token-step
                                      │
                                      ▼
                              audio-decoder ──► 48 kHz PCM

The host owns the loop and the KV cache; LiteRT owns the static tensor programs. Same split used for Parakeet and Nemotron in this collection — LiteRT for the math, host for the control flow.

Files

File	Size	Description
`voxcpm2-text-prefill.tflite`	7.7 GB	FP32 text + instruction prefill (MiniCPM-4 KV-cache producer)
`voxcpm2-token-step.tflite`	2.0 GB	INT8 weight-only autoregressive step (MiniCPM-4 + residual LM)
`voxcpm2-audio-encoder.tflite`	184 MB	FP32 reference-audio encoder (16 kHz → conditioning)
`voxcpm2-audio-decoder.tflite`	175 MB	FP32 AudioVAE decoder (acoustic tokens → 48 kHz PCM)
`tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json`	—	HF tokenizer bundle
`generation_config.json` / `tokenization_voxcpm2.py`	—	Generation defaults + tokenizer module
`config.json`	—	Tensor shapes, sample rates, files manifest

Quantization

token-step: INT8 weight-only (the only graph that runs in the inner generation loop — quantizing here is the biggest win).
text-prefill / audio-encoder / audio-decoder: stay FP32. Quantizing prefill caused semantic drift in roundtrip; the AudioVAE decoder is audible-risky under INT8.

Smoke result

30-step English roundtrip ("hello world from soniqo dot audio", instruction "clear neutral delivery"):

Stop token fired naturally at step 18 (decoder halted before the 30-step ceiling)
138 240 samples × 48 kHz mono = 2.88 s
RMS 0.033, peak 0.44 — no clipping, real signal level
Output written to voxcpm2-litert-hello-world.wav

Modes

Mirrors the speech-swift VoxCPM2TTS mode matrix:

Mode	Inputs
Zero-shot	text
Voice design	text + style instruction
Controllable cloning	text + reference audio
Ultimate cloning	text + reference audio + prompt audio + prompt text

For Apple Silicon, prefer the MLX bundles (bf16 / int8 / int4) consumed by speech-swift.

Source

Exported from openbmb/VoxCPM2 via a graph-split LiteRT conversion, run in a pinned Docker environment because LiteRT / Torch / TorchAO versions are tightly coupled.

Responsible use

Voice cloning is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without permission, generate disinformation, or commit fraud.

Ecosystem

soniqo.audio — use-case explorer (transcription, voice cloning, live ASR, voice agents).
speech-core — C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
speech-swift — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
speech-android — Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

License

This bundle inherits the upstream model license (apache-2.0). See the linked base_model repository for the full terms.

Downloads last month: 37

Model tree for soniqo/VoxCPM2-LiteRT-INT8

Base model

openbmb/VoxCPM2

Finetuned

(16)

this model

Collection including soniqo/VoxCPM2-LiteRT-INT8

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 17 items • Updated about 3 hours ago • 1