VoxCPM2 LiteRT (INT8)

LiteRT / TensorFlow Lite port of openbmb/VoxCPM2 β€” a 2B-parameter multilingual diffusion-autoregressive TTS model with 48 kHz studio-quality output, voice cloning, and instruction-driven voice design.

This bundle ships the model as four separate LiteRT graphs plus a manifest. The on-device worker is expected to orchestrate the loop:

text_prefill ─► token_step (Γ—N) ─► audio_decode

Reference audio is encoded once via audio_encoder. The K/V cache for the LM and residual decoder is owned by the host worker (not mutated inside the graph), which lets the runtime own retry / idempotency semantics.

Part of soniqo.audio β€” an on-device speech toolkit. Consumed by the Android SDK at speech-android.

Status: experimental. The token_step graph depends on litert_torch static K/V-cache lowering; integrators should validate numerical parity end-to-end before relying on this bundle in production.

Capabilities

  • 30 languages including English, Chinese, Indonesian, Japanese, Korean
  • 48 kHz output
  • Zero-shot synthesis β€” generate speech from text alone
  • Voice cloning β€” clone a target speaker from a single reference clip
  • Voice design β€” natural-language style control (e.g. "young female voice, warm and gentle")
  • Ultimate cloning β€” reference audio + transcript for prosody-preserving cloning

Files

File Variant Role
voxcpm2-text-prefill.tflite INT8 weights / FP32 activations Encode text + (optional) reference-audio prefix into LM hidden states, residual hidden, prefix feature conditioning, and the initial K/V caches.
voxcpm2-token-step.tflite INT8 weights / FP32 activations One AR step. Takes current LM / residual hidden, conditioning, K/V cache and position id. Emits the next predicted feature, stop logits, updated hidden states, and updated K/V cache.
voxcpm2-audio-encoder.tflite FP32 Encode a reference clip (16 kHz PCM) into the patch features that condition the prefill.
voxcpm2-audio-decoder.tflite FP32 Decode latent β†’ 48 kHz PCM via the upstream AudioVAE.
config.json β€” Manifest: tensor signatures, sample rates, default CFG / step counts, file mapping.
tokenizer.json / tokenizer_config.json / special_tokens_map.json / generation_config.json β€” HF tokenizer + generation defaults.
tokenization_voxcpm2.py β€” Upstream tokenizer source (kept for parity with the HF model).

The conv-heavy encoder and decoder are intentionally kept FP32 β€” the dynamic_wi8_afp32 recipe does not lower the conv kernels that VoxCPM2's AudioVAE relies on, and quantising the vocoder has historically been audible-risky.

Default decoding parameters

Parameter Default
max_text_tokens (context) 512
max_generated_tokens 2048
inference_timesteps (CFM) 10
cfg_value 2.0
Sample rate (output) 48 000 Hz
Sample rate (audio conditioning) 16 000 Hz

These mirror the host-side defaults exposed in config.json; runtimes are free to override them.

Token-step cache contract

The token-step graph takes and returns the LM and residual K/V cache as explicit inputs/outputs. Cache layout:

[2, layers, batch, kv_heads, max_cache_length, head_dim]
  • Axis 0: [K, V]
  • Axis 4 is sized to max_text_tokens + max_generated_tokens and pre-allocated by the worker.
  • The graph does not mutate the cache buffers in place β€” it produces updated tensors which the worker copies / swaps.

This contract is what makes parallel decoding, mid-generation cancellation, and deterministic replay possible from the C++ side.

Source

Converted from the upstream PyTorch weights at openbmb/VoxCPM2 using litert_torch + ai_edge_quantizer's dynamic_wi8_afp32 recipe.

Links

License

Apache 2.0 (inherited from upstream openbmb/VoxCPM2).

Responsible use

Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/VoxCPM2-LiteRT

Base model

openbmb/VoxCPM2
Finetuned
(11)
this model

Collection including aufklarer/VoxCPM2-LiteRT