VoxCPM2 LiteRT (INT8)

LiteRT / TensorFlow Lite port of openbmb/VoxCPM2 — a 2B-parameter multilingual diffusion-autoregressive TTS model with 48 kHz studio-quality output, voice cloning, and instruction-driven voice design.

This bundle ships the model as four separate LiteRT graphs plus a manifest. The on-device worker is expected to orchestrate the loop:

text_prefill ─► token_step (×N) ─► audio_decode

Reference audio is encoded once via audio_encoder. The K/V cache for the LM and residual decoder is owned by the host worker (not mutated inside the graph), which lets the runtime own retry / idempotency semantics.

Part of soniqo.audio — an on-device speech toolkit. Consumed by the Android SDK at speech-android.

Status: experimental. The token_step graph depends on litert_torch static K/V-cache lowering; integrators should validate numerical parity end-to-end before relying on this bundle in production.

Capabilities

30 languages including English, Chinese, Indonesian, Japanese, Korean
48 kHz output
Zero-shot synthesis — generate speech from text alone
Voice cloning — clone a target speaker from a single reference clip
Voice design — natural-language style control (e.g. "young female voice, warm and gentle")
Ultimate cloning — reference audio + transcript for prosody-preserving cloning

Files

File	Variant	Role
`voxcpm2-text-prefill.tflite`	INT8 weights / FP32 activations	Encode text + (optional) reference-audio prefix into LM hidden states, residual hidden, prefix feature conditioning, and the initial K/V caches.
`voxcpm2-token-step.tflite`	INT8 weights / FP32 activations	One AR step. Takes current LM / residual hidden, conditioning, K/V cache and position id. Emits the next predicted feature, stop logits, updated hidden states, and updated K/V cache.
`voxcpm2-audio-encoder.tflite`	FP32	Encode a reference clip (16 kHz PCM) into the patch features that condition the prefill.
`voxcpm2-audio-decoder.tflite`	FP32	Decode latent → 48 kHz PCM via the upstream AudioVAE.
`config.json`	—	Manifest: tensor signatures, sample rates, default CFG / step counts, file mapping.
`tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json` / `generation_config.json`	—	HF tokenizer + generation defaults.
`tokenization_voxcpm2.py`	—	Upstream tokenizer source (kept for parity with the HF model).

The conv-heavy encoder and decoder are intentionally kept FP32 — the dynamic_wi8_afp32 recipe does not lower the conv kernels that VoxCPM2's AudioVAE relies on, and quantising the vocoder has historically been audible-risky.

Default decoding parameters

Parameter	Default
`max_text_tokens` (context)	512
`max_generated_tokens`	2048
`inference_timesteps` (CFM)	10
`cfg_value`	2.0
Sample rate (output)	48 000 Hz
Sample rate (audio conditioning)	16 000 Hz

These mirror the host-side defaults exposed in config.json; runtimes are free to override them.

Token-step cache contract

The token-step graph takes and returns the LM and residual K/V cache as explicit inputs/outputs. Cache layout:

[2, layers, batch, kv_heads, max_cache_length, head_dim]

Axis 0: [K, V]
Axis 4 is sized to max_text_tokens + max_generated_tokens and pre-allocated by the worker.
The graph does not mutate the cache buffers in place — it produces updated tensors which the worker copies / swaps.

This contract is what makes parallel decoding, mid-generation cancellation, and deterministic replay possible from the C++ side.

Source

Converted from the upstream PyTorch weights at openbmb/VoxCPM2 using litert_torch + ai_edge_quantizer's dynamic_wi8_afp32 recipe.

License

Apache 2.0 (inherited from upstream openbmb/VoxCPM2).

Responsible use

Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.

Downloads last month: -

Model tree for aufklarer/VoxCPM2-LiteRT

Base model

openbmb/VoxCPM2

Finetuned

(11)

this model

Collection including aufklarer/VoxCPM2-LiteRT

Speech Android Models

Collection

Mobile ONNX models for speech-android SDK • 7 items • Updated 1 day ago • 1

aufklarer
/

VoxCPM2-LiteRT