OmniVoice-LiteRT — On-Device TTS Export for Android

Exported and optimized weights from k2-fsa/OmniVoice for on-device inference on Android via LiteRT (formerly TFLite).

OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting 600+ languages with voice cloning and voice design capabilities.

Model Architecture

OmniVoice uses a Diffusion Language Model architecture — non-autoregressive (NAR), which means:

No KV cache (constant memory per step)
Parallel token prediction across all positions
Only 64MB activation overhead during inference

The model is split into two independently loadable modules:

Component	Params	FP16 Size	Purpose
Backbone (Qwen3-based LLM)	612.6M	1.2GB	Text + masked tokens → audio logits (runs N times)
Decoder (HiggsAudioV2)	200.3M	0.4GB	Audio codes → 24kHz waveform (runs once)
Total	812.9M	1.6GB

Key Architecture Details

LLM backbone: Qwen3 (28 layers, 1024 hidden, 16 heads) with bidirectional attention (NOT causal)
Audio codebooks: 8 codebooks × 1025 vocab (1024 tokens + 1 mask token)
Frame rate: 25 fps (40ms per frame)
Output sample rate: 24kHz
Attention: Must use full bidirectional attention mask (all True), NOT causal mask

Files

File	Size	Description
`omnivoice_backbone_exported.pt2`	2.3GB	torch.export'd backbone (fp32)
`omnivoice_decoder_exported.pt2`	770MB	torch.export'd decoder (fp32)
`omnivoice_backbone_state.pt`	2.3GB	Backbone state dict (fp32)
`omnivoice_decoder_state.pt`	769MB	Decoder state dict (fp32)
`export_backbone.py`	5.3KB	Backbone module definition + export script
`export_decoder.py`	2.9KB	Decoder module definition + export script
`test_e2e_v2.py`	9.7KB	Full E2E inference script (simulates Android)

Performance (Jetson AGX Orin, FP16)

Metric	Value
Load time	~13s (cached)
Diffusion (32 steps, 5s audio)	~4s
Decode	~0.1-0.3s
RTF	0.65-0.94
GPU memory (backbone)	2,031 MB
Peak inference memory	2,095 MB
Activation overhead	64 MB only

Android Estimate

Config	Model Size	RAM Needed	RTF (est.)
FP16 + GPU delegate	~1.6GB	~2.5GB	1-3x (16 steps)
FP16 + 16 steps	~1.6GB	~2.5GB	2x faster
INT8	~800MB	~1.3GB	⚠️ Quality degradation

Recommendation: Use FP16 with GPU delegate on Android. INT8 causes audible quality loss due to error accumulation over 32 diffusion steps.

Inference Pipeline

The diffusion loop runs in app code (Kotlin/Python), not inside the model. Each step is a single forward pass:

from export_backbone import OmniVoiceBackbone
from export_decoder import OmniVoiceDecoder

# Load models
backbone = OmniVoiceBackbone(model).to("cuda:0").half().eval()
decoder = OmniVoiceDecoder(model.audio_tokenizer).eval()

# Prepare input with special tokens
# <|denoise|><|lang_start|>None<|lang_end|><|instruct_start|>None<|instruct_end|>
# <|text_start|>{text}<|text_end|>
# [MASK tokens for target audio]

# Diffusion loop (16-32 steps)
for step in range(num_steps):
    logits = backbone(input_ids, audio_mask)  # [B, 8, S, 1025]
    # CFG: cond + guidance_scale * (cond - uncond)
    # Sample tokens, unmask top-k confident positions
    # Layer penalty: encourage lower codebooks first

# Decode to audio
waveform = decoder(generated_tokens)  # [1, 1, samples] @ 24kHz

Voice Cloning

For voice cloning, encode reference audio to tokens and prepend:

# Encode reference audio → [8, T_ref] tokens
ref_tokens = audio_tokenizer.encode(ref_audio)

# Input: [style_tokens | text_tokens | ref_audio_tokens | MASK_target]
# The model learns speaker characteristics from ref_tokens

Critical: Bidirectional Attention

⚠️ The backbone LLM MUST use bidirectional (full) attention, NOT causal:

attention_mask = torch.ones(B, 1, S, S, dtype=torch.bool)  # ALL True

Using causal attention produces gibberish output.

Supported Modes

Mode	Input	Description
Auto voice	text only	Model picks a random voice
Voice cloning	text + ref_audio + ref_text	Clone voice from reference
Voice design	text + instruct	Describe voice: "male, low pitch, british accent"

Quantization Notes

INT8 weight-only: Saves ~384MB GPU but causes audible quality degradation due to error accumulation across 32 diffusion steps. Not recommended.
FP16: Best quality-to-size ratio. Mobile GPUs natively support FP16.
FP32: Used for export. Convert to FP16 at load time.

Conversion to .tflite

The .pt2 files can be converted to LiteRT format. See CONVERSION_GUIDE.md for full step-by-step instructions.

Requirements: x86 Linux machine (litert-torch needs ai-edge-tensorflow, no aarch64 wheels).

# On x86 machine
import litert_torch
edge_model = litert_torch.convert(backbone.eval(), sample_inputs)
edge_model.export("omnivoice_backbone.tflite")

Alternative: Use the DigitalOcean Droplet or any x86 cloud VM. Conversion is lossless — identical output to PyTorch.

Credits

Original model: k2-fsa/OmniVoice (Apache 2.0)
Paper: OmniVoice: Towards Omnilingual Zero-Shot TTS with Diffusion Language Models
Authors: Han Zhu, Lingxuan Ye, Wei Kang, et al. (Xiaomi Corp.)
Export & testing: Samsul Rahmadani (@acul3)

Downloads last month: -

Model tree for acul3/OmniVoice-LiteRT

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

k2-fsa/OmniVoice

Finetuned

(45)

this model

Paper for acul3/OmniVoice-LiteRT

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Paper • 2604.00688 • Published Apr 1 • 17