Instructions to use aufklarer/VoxCPM2-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use aufklarer/VoxCPM2-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- VoxCPM
How to use aufklarer/VoxCPM2-LiteRT with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("aufklarer/VoxCPM2-LiteRT") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
VoxCPM2 LiteRT (INT8)
LiteRT / TensorFlow Lite port of openbmb/VoxCPM2 β a 2B-parameter multilingual diffusion-autoregressive TTS model with 48 kHz studio-quality output, voice cloning, and instruction-driven voice design.
This bundle ships the model as four separate LiteRT graphs plus a manifest. The on-device worker is expected to orchestrate the loop:
text_prefill ββΊ token_step (ΓN) ββΊ audio_decode
Reference audio is encoded once via audio_encoder. The K/V cache for the LM and residual decoder is owned by the host worker (not mutated inside the graph), which lets the runtime own retry / idempotency semantics.
Part of soniqo.audio β an on-device speech toolkit. Consumed by the Android SDK at speech-android.
Status: experimental. The
token_stepgraph depends onlitert_torchstatic K/V-cache lowering; integrators should validate numerical parity end-to-end before relying on this bundle in production.
Capabilities
- 30 languages including English, Chinese, Indonesian, Japanese, Korean
- 48 kHz output
- Zero-shot synthesis β generate speech from text alone
- Voice cloning β clone a target speaker from a single reference clip
- Voice design β natural-language style control (e.g. "young female voice, warm and gentle")
- Ultimate cloning β reference audio + transcript for prosody-preserving cloning
Files
| File | Variant | Role |
|---|---|---|
voxcpm2-text-prefill.tflite |
INT8 weights / FP32 activations | Encode text + (optional) reference-audio prefix into LM hidden states, residual hidden, prefix feature conditioning, and the initial K/V caches. |
voxcpm2-token-step.tflite |
INT8 weights / FP32 activations | One AR step. Takes current LM / residual hidden, conditioning, K/V cache and position id. Emits the next predicted feature, stop logits, updated hidden states, and updated K/V cache. |
voxcpm2-audio-encoder.tflite |
FP32 | Encode a reference clip (16 kHz PCM) into the patch features that condition the prefill. |
voxcpm2-audio-decoder.tflite |
FP32 | Decode latent β 48 kHz PCM via the upstream AudioVAE. |
config.json |
β | Manifest: tensor signatures, sample rates, default CFG / step counts, file mapping. |
tokenizer.json / tokenizer_config.json / special_tokens_map.json / generation_config.json |
β | HF tokenizer + generation defaults. |
tokenization_voxcpm2.py |
β | Upstream tokenizer source (kept for parity with the HF model). |
The conv-heavy encoder and decoder are intentionally kept FP32 β the dynamic_wi8_afp32 recipe does not lower the conv kernels that VoxCPM2's AudioVAE relies on, and quantising the vocoder has historically been audible-risky.
Default decoding parameters
| Parameter | Default |
|---|---|
max_text_tokens (context) |
512 |
max_generated_tokens |
2048 |
inference_timesteps (CFM) |
10 |
cfg_value |
2.0 |
| Sample rate (output) | 48 000 Hz |
| Sample rate (audio conditioning) | 16 000 Hz |
These mirror the host-side defaults exposed in config.json; runtimes are free to override them.
Token-step cache contract
The token-step graph takes and returns the LM and residual K/V cache as explicit inputs/outputs. Cache layout:
[2, layers, batch, kv_heads, max_cache_length, head_dim]
- Axis 0:
[K, V] - Axis 4 is sized to
max_text_tokens + max_generated_tokensand pre-allocated by the worker. - The graph does not mutate the cache buffers in place β it produces updated tensors which the worker copies / swaps.
This contract is what makes parallel decoding, mid-generation cancellation, and deterministic replay possible from the C++ side.
Source
Converted from the upstream PyTorch weights at openbmb/VoxCPM2 using litert_torch + ai_edge_quantizer's dynamic_wi8_afp32 recipe.
Links
- speech-android β Android SDK
- soniqo.audio β website
- blog β blog
License
Apache 2.0 (inherited from upstream openbmb/VoxCPM2).
Responsible use
Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.
- Downloads last month
- -
Model tree for aufklarer/VoxCPM2-LiteRT
Base model
openbmb/VoxCPM2