OmniVoice-LiteRT β On-Device TTS Export for Android
Exported and optimized weights from k2-fsa/OmniVoice for on-device inference on Android via LiteRT (formerly TFLite).
OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting 600+ languages with voice cloning and voice design capabilities.
Model Architecture
OmniVoice uses a Diffusion Language Model architecture β non-autoregressive (NAR), which means:
- No KV cache (constant memory per step)
- Parallel token prediction across all positions
- Only 64MB activation overhead during inference
The model is split into two independently loadable modules:
| Component | Params | FP16 Size | Purpose |
|---|---|---|---|
| Backbone (Qwen3-based LLM) | 612.6M | 1.2GB | Text + masked tokens β audio logits (runs N times) |
| Decoder (HiggsAudioV2) | 200.3M | 0.4GB | Audio codes β 24kHz waveform (runs once) |
| Total | 812.9M | 1.6GB |
Key Architecture Details
- LLM backbone: Qwen3 (28 layers, 1024 hidden, 16 heads) with bidirectional attention (NOT causal)
- Audio codebooks: 8 codebooks Γ 1025 vocab (1024 tokens + 1 mask token)
- Frame rate: 25 fps (40ms per frame)
- Output sample rate: 24kHz
- Attention: Must use full bidirectional attention mask (all True), NOT causal mask
Files
| File | Size | Description |
|---|---|---|
omnivoice_backbone_exported.pt2 |
2.3GB | torch.export'd backbone (fp32) |
omnivoice_decoder_exported.pt2 |
770MB | torch.export'd decoder (fp32) |
omnivoice_backbone_state.pt |
2.3GB | Backbone state dict (fp32) |
omnivoice_decoder_state.pt |
769MB | Decoder state dict (fp32) |
export_backbone.py |
5.3KB | Backbone module definition + export script |
export_decoder.py |
2.9KB | Decoder module definition + export script |
test_e2e_v2.py |
9.7KB | Full E2E inference script (simulates Android) |
Performance (Jetson AGX Orin, FP16)
| Metric | Value |
|---|---|
| Load time | ~13s (cached) |
| Diffusion (32 steps, 5s audio) | ~4s |
| Decode | ~0.1-0.3s |
| RTF | 0.65-0.94 |
| GPU memory (backbone) | 2,031 MB |
| Peak inference memory | 2,095 MB |
| Activation overhead | 64 MB only |
Android Estimate
| Config | Model Size | RAM Needed | RTF (est.) |
|---|---|---|---|
| FP16 + GPU delegate | ~1.6GB | ~2.5GB | 1-3x (16 steps) |
| FP16 + 16 steps | ~1.6GB | ~2.5GB | 2x faster |
| INT8 | ~800MB | ~1.3GB | β οΈ Quality degradation |
Recommendation: Use FP16 with GPU delegate on Android. INT8 causes audible quality loss due to error accumulation over 32 diffusion steps.
Inference Pipeline
The diffusion loop runs in app code (Kotlin/Python), not inside the model. Each step is a single forward pass:
from export_backbone import OmniVoiceBackbone
from export_decoder import OmniVoiceDecoder
# Load models
backbone = OmniVoiceBackbone(model).to("cuda:0").half().eval()
decoder = OmniVoiceDecoder(model.audio_tokenizer).eval()
# Prepare input with special tokens
# <|denoise|><|lang_start|>None<|lang_end|><|instruct_start|>None<|instruct_end|>
# <|text_start|>{text}<|text_end|>
# [MASK tokens for target audio]
# Diffusion loop (16-32 steps)
for step in range(num_steps):
logits = backbone(input_ids, audio_mask) # [B, 8, S, 1025]
# CFG: cond + guidance_scale * (cond - uncond)
# Sample tokens, unmask top-k confident positions
# Layer penalty: encourage lower codebooks first
# Decode to audio
waveform = decoder(generated_tokens) # [1, 1, samples] @ 24kHz
Voice Cloning
For voice cloning, encode reference audio to tokens and prepend:
# Encode reference audio β [8, T_ref] tokens
ref_tokens = audio_tokenizer.encode(ref_audio)
# Input: [style_tokens | text_tokens | ref_audio_tokens | MASK_target]
# The model learns speaker characteristics from ref_tokens
Critical: Bidirectional Attention
β οΈ The backbone LLM MUST use bidirectional (full) attention, NOT causal:
attention_mask = torch.ones(B, 1, S, S, dtype=torch.bool) # ALL True
Using causal attention produces gibberish output.
Supported Modes
| Mode | Input | Description |
|---|---|---|
| Auto voice | text only | Model picks a random voice |
| Voice cloning | text + ref_audio + ref_text | Clone voice from reference |
| Voice design | text + instruct | Describe voice: "male, low pitch, british accent" |
Quantization Notes
- INT8 weight-only: Saves ~384MB GPU but causes audible quality degradation due to error accumulation across 32 diffusion steps. Not recommended.
- FP16: Best quality-to-size ratio. Mobile GPUs natively support FP16.
- FP32: Used for export. Convert to FP16 at load time.
Conversion to .tflite
The .pt2 files can be converted to LiteRT format. See CONVERSION_GUIDE.md for full step-by-step instructions.
Requirements: x86 Linux machine (litert-torch needs ai-edge-tensorflow, no aarch64 wheels).
# On x86 machine
import litert_torch
edge_model = litert_torch.convert(backbone.eval(), sample_inputs)
edge_model.export("omnivoice_backbone.tflite")
Alternative: Use the DigitalOcean Droplet or any x86 cloud VM. Conversion is lossless β identical output to PyTorch.
Credits
- Original model: k2-fsa/OmniVoice (Apache 2.0)
- Paper: OmniVoice: Towards Omnilingual Zero-Shot TTS with Diffusion Language Models
- Authors: Han Zhu, Lingxuan Ye, Wei Kang, et al. (Xiaomi Corp.)
- Export & testing: Samsul Rahmadani (@acul3)
- Downloads last month
- -