OmniVoice-LiteRT β€” On-Device TTS Export for Android

Exported and optimized weights from k2-fsa/OmniVoice for on-device inference on Android via LiteRT (formerly TFLite).

OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting 600+ languages with voice cloning and voice design capabilities.

Model Architecture

OmniVoice uses a Diffusion Language Model architecture β€” non-autoregressive (NAR), which means:

  • No KV cache (constant memory per step)
  • Parallel token prediction across all positions
  • Only 64MB activation overhead during inference

The model is split into two independently loadable modules:

Component Params FP16 Size Purpose
Backbone (Qwen3-based LLM) 612.6M 1.2GB Text + masked tokens β†’ audio logits (runs N times)
Decoder (HiggsAudioV2) 200.3M 0.4GB Audio codes β†’ 24kHz waveform (runs once)
Total 812.9M 1.6GB

Key Architecture Details

  • LLM backbone: Qwen3 (28 layers, 1024 hidden, 16 heads) with bidirectional attention (NOT causal)
  • Audio codebooks: 8 codebooks Γ— 1025 vocab (1024 tokens + 1 mask token)
  • Frame rate: 25 fps (40ms per frame)
  • Output sample rate: 24kHz
  • Attention: Must use full bidirectional attention mask (all True), NOT causal mask

Files

File Size Description
omnivoice_backbone_exported.pt2 2.3GB torch.export'd backbone (fp32)
omnivoice_decoder_exported.pt2 770MB torch.export'd decoder (fp32)
omnivoice_backbone_state.pt 2.3GB Backbone state dict (fp32)
omnivoice_decoder_state.pt 769MB Decoder state dict (fp32)
export_backbone.py 5.3KB Backbone module definition + export script
export_decoder.py 2.9KB Decoder module definition + export script
test_e2e_v2.py 9.7KB Full E2E inference script (simulates Android)

Performance (Jetson AGX Orin, FP16)

Metric Value
Load time ~13s (cached)
Diffusion (32 steps, 5s audio) ~4s
Decode ~0.1-0.3s
RTF 0.65-0.94
GPU memory (backbone) 2,031 MB
Peak inference memory 2,095 MB
Activation overhead 64 MB only

Android Estimate

Config Model Size RAM Needed RTF (est.)
FP16 + GPU delegate ~1.6GB ~2.5GB 1-3x (16 steps)
FP16 + 16 steps ~1.6GB ~2.5GB 2x faster
INT8 ~800MB ~1.3GB ⚠️ Quality degradation

Recommendation: Use FP16 with GPU delegate on Android. INT8 causes audible quality loss due to error accumulation over 32 diffusion steps.

Inference Pipeline

The diffusion loop runs in app code (Kotlin/Python), not inside the model. Each step is a single forward pass:

from export_backbone import OmniVoiceBackbone
from export_decoder import OmniVoiceDecoder

# Load models
backbone = OmniVoiceBackbone(model).to("cuda:0").half().eval()
decoder = OmniVoiceDecoder(model.audio_tokenizer).eval()

# Prepare input with special tokens
# <|denoise|><|lang_start|>None<|lang_end|><|instruct_start|>None<|instruct_end|>
# <|text_start|>{text}<|text_end|>
# [MASK tokens for target audio]

# Diffusion loop (16-32 steps)
for step in range(num_steps):
    logits = backbone(input_ids, audio_mask)  # [B, 8, S, 1025]
    # CFG: cond + guidance_scale * (cond - uncond)
    # Sample tokens, unmask top-k confident positions
    # Layer penalty: encourage lower codebooks first

# Decode to audio
waveform = decoder(generated_tokens)  # [1, 1, samples] @ 24kHz

Voice Cloning

For voice cloning, encode reference audio to tokens and prepend:

# Encode reference audio β†’ [8, T_ref] tokens
ref_tokens = audio_tokenizer.encode(ref_audio)

# Input: [style_tokens | text_tokens | ref_audio_tokens | MASK_target]
# The model learns speaker characteristics from ref_tokens

Critical: Bidirectional Attention

⚠️ The backbone LLM MUST use bidirectional (full) attention, NOT causal:

attention_mask = torch.ones(B, 1, S, S, dtype=torch.bool)  # ALL True

Using causal attention produces gibberish output.

Supported Modes

Mode Input Description
Auto voice text only Model picks a random voice
Voice cloning text + ref_audio + ref_text Clone voice from reference
Voice design text + instruct Describe voice: "male, low pitch, british accent"

Quantization Notes

  • INT8 weight-only: Saves ~384MB GPU but causes audible quality degradation due to error accumulation across 32 diffusion steps. Not recommended.
  • FP16: Best quality-to-size ratio. Mobile GPUs natively support FP16.
  • FP32: Used for export. Convert to FP16 at load time.

Conversion to .tflite

The .pt2 files can be converted to LiteRT format. See CONVERSION_GUIDE.md for full step-by-step instructions.

Requirements: x86 Linux machine (litert-torch needs ai-edge-tensorflow, no aarch64 wheels).

# On x86 machine
import litert_torch
edge_model = litert_torch.convert(backbone.eval(), sample_inputs)
edge_model.export("omnivoice_backbone.tflite")

Alternative: Use the DigitalOcean Droplet or any x86 cloud VM. Conversion is lossless β€” identical output to PyTorch.

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for acul3/OmniVoice-LiteRT

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(3)
this model

Paper for acul3/OmniVoice-LiteRT