Mega-ASR β€” LiteRT (TFLite) dynamic INT4, block 32

LiteRT (formerly TFLite) deployment of the LLM portion of zhifeixie/Mega-ASR, converted via litert-torch (formerly ai-edge-torch) with the dynamic_int4 recipe at block size 32.

Targets Android, ChromeOS, embedded Linux, and any LiteRT runtime. Apple platforms get better results from the CoreML variant; NVIDIA GPUs get better results from NVFP4 β€” this artifact exists for the LiteRT ecosystem.

What's in this repo

File Size Role
litert/mega-asr-llm_embeds_q4_block32_ekv1024.tflite 975 MB Main ASR artifact. Takes pre-computed inputs_embeds (audio embeds scattered at <|audio_pad|> positions). Signatures: prefill_512 + decode. KV cache external, max length 1024.
litert/mega-asr-llm_q4_block32_ekv512.tflite 972 MB Pure-text Qwen3-1.7B variant. Takes int32 token IDs; embed_tokens baked into the graph. Use this for LiteRT-LM bundling or pure-text Qwen3 inference. Not directly usable for ASR β€” the audio path needs external embedding scatter.
onnx/audio_encoder_fp32.onnx 1.27 GB 24-layer Whisper-style audio encoder (ONNX fp32, runs via onnxruntime; LiteRT port not done β€” its op coverage gap on the audio encoder is wider than the LLM and the runtime savings are marginal).
tokenizer/* β€” Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, …)
examples/*.wav ~3 MB 8 noisy benchmark clips from Voices-in-the-Wild-Bench
litert_convert_embeds.py β€” The conversion script (litert-torch + dynamic_int4 block 32)
inference_bench.py β€” End-to-end ASR pipeline + 8-clip VITW bench (used to produce the numbers below)

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 βˆ’ WER), prompt forced to language English, run on the same ONNX fp32 audio encoder as the other backends. LiteRT runtime via XNNPACK delegate (CPU on stallion x86_64 β€” Android/ARM would run the same .tflite via the same delegate at lower latency-per-watt):

Per-sample LiteRT (this repo) ONNX GPTQ MLX 8/4 CoreML 8/4 NVFP4
distortion 100% 100% 100% 100% 100%
dropout 100% 100% 100% 100% 100%
echo (hard, reverb) 58.8% 82.4% 64.7% 64.7% 64.7%
far_field 100% 100% 100% 100% 100%
mixed 100% 100% 100% 100% 100%
noise 100% 100% 100% 100% 100%
obstructed 94.1% 100% 94.1% 100% 100%
recording (hard, truncated) 33.3% 60.0% 60.0% 60.0% 66.7%
AVERAGE 85.8% 92.7% 92.2% 90.6% 91.4%

LiteRT lands ~5-7 pts behind the other 4-bit backends, with both losses concentrated on the two hard clips:

  • recording (truncated audio): 33.3% β€” the dynamic-range INT4 quantization with no calibration is the harshest setting here, and the truncated-audio decode pattern (where context-aware confidence really matters) is the most sensitive to weight precision loss. AWQ-style activation-aware scaling (NVFP4) or per-column GPTQ (ONNX) recover ~25-30 pts on this clip; pure dynamic-range quant doesn't have the headroom.
  • echo (heavy reverb): 58.8% β€” same pattern, smaller gap (~6 pts vs MLX/CoreML).

The clean clips are all 100% β€” the quantization is fine for the easy cases; it's the hard cases where the activation/weight precision interaction kicks in.

How LiteRT dynamic INT4 block 32 works (quick)

  • Weights are stored as INT4 symmetric, one scale per block of 32 consecutive elements along the input dimension (so a Linear layer with in_features=2048 has 64 scales per output row).
  • Scales are FP32 (no second-level FP8 like NVFP4; LiteRT's runtime uses XNNPACK's INT4 GEMM kernel, which expects FP32 block scales).
  • At inference, the XNNPACK delegate dequantizes blocks on-the-fly into FP32 / FP16 (depending on accumulator) and runs the GEMM. No native INT4 tensor cores on commodity Android SoCs β€” the win is memory bandwidth (4 bits/weight + tiny scale overhead vs 16 bits/weight).

"Dynamic" means activations stay FP32; only the weights are quantized. This trades some quality for simplicity (no calibration needed) and ops coverage (every Linear op the LiteRT runtime knows can take INT4 weights, no special calibrated activation observers required).

Inference

pip install ai-edge-litert onnxruntime transformers safetensors soundfile librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-litert
cd mega-asr-litert
python inference_bench.py \
    --tflite litert/mega-asr-llm_embeds_q4_block32_ekv1024.tflite \
    --encoder onnx/audio_encoder_fp32.onnx \
    --examples-dir examples

The bench script runs the full ASR pipeline:

  1. Audio file β†’ mel-spectrogram (AutoFeatureExtractor).
  2. mel β†’ ONNX audio encoder β†’ audio embeddings of shape (T_audio, 2048).
  3. Build prompt with <|audio_pad|> expanded to T_audio positions; tokenize the chat-style prefix.
  4. Look up text token embeddings from the embed_tokens weight in the accompanying safetensors (this is the embedding layer that's not baked into the inputs_embeds variant β€” by design).
  5. Scatter audio embeds at the <|audio_pad|> positions in the embed sequence.
  6. Pad to 512 and call the prefill_512 signature; KV cache is external (28 layers Γ— 2 K/V Γ— float32, shape (1, 1024, 8, 128)).
  7. Greedy decode loop on the decode signature until EOS (151645).

Conversion details

from litert_torch._convert import interface as converter_utils
from litert_torch.generative.examples.qwen import qwen3
from litert_torch.generative.layers import kv_cache as kv_utils
from litert_torch.generative.quantize import quant_attrs, quant_recipes

model = qwen3.build_1_7b_model(
    checkpoint_path="Mega-ASR-LLM/",  # safetensors of the merged LLM portion
    mask_cache_size=1024,
)

class EmbedsForward(torch.nn.Module):
    """Bypass tok_embedding β€” public forward takes inputs_embeds instead of tokens."""
    def __init__(self, inner):
        super().__init__()
        self.m, self.cfg = inner, inner.config
    def forward(self, inputs_embeds, input_pos, kv_cache):
        attn = self.cfg.block_config(0).attn_config
        n_elem = int(attn.rotary_percentage * attn.head_dim)
        rope = self.cfg.build_rope(input_pos, n_elem, attn.rotary_base)
        mask = self.m.mask_cache.index_select(2, input_pos)
        mask = mask[:, :, :, :kv_cache.get_max_seq_len()]
        return self.m._forward_with_embeds(
            inputs_embeds, rope, mask, input_pos, kv_cache, None, None
        )

wrapper = EmbedsForward(model).eval()
qcfg = quant_recipes.full_dynamic_recipe(
    mcfg=model.config,
    weight_dtype=quant_attrs.Dtype.INT4,
    granularity=quant_attrs.Granularity.BLOCKWISE_32,
)

converter = converter_utils.Converter()
converter.add_signature("prefill_512", wrapper, sample_kwargs={
    "inputs_embeds": torch.zeros(1, 512, 2048), "input_pos": torch.arange(512, dtype=torch.int),
    "kv_cache": kv_utils.KVCache.from_model_config(1024, model.config),
})
converter.add_signature("decode", wrapper, sample_kwargs={
    "inputs_embeds": torch.zeros(1, 1, 2048), "input_pos": torch.tensor([0], dtype=torch.int),
    "kv_cache": kv_utils.KVCache.from_model_config(1024, model.config),
})
litert_model = converter.convert(quant_config=qcfg)
litert_model.export("mega-asr-llm_embeds_q4_block32_ekv1024.tflite")

Conversion takes ~2 minutes on an x86 box but requires ~30 GB RAM peak (we added 24 GB swap on top of the host's 32 GB physical RAM to survive the "Write Model to Bytes" step β€” the intermediate fp32 model size is ~6.4 GB and the flatbuffer writer needs ~3-4Γ— that in headroom).

Why two .tflite files?

The default litert-torch convert_v3_to_tflite script produces a tflite whose prefill signature takes int32 token IDs β€” embed_tokens lives inside the graph. That's fine for pure text generation (and is what LiteRT-LM expects). But ASR needs to inject audio embeddings at the <|audio_pad|> positions before the LLM, which requires bypassing the internal embed layer.

So we ship both:

  • mega-asr-llm_embeds_q4_block32_ekv1024.tflite β€” the ASR artifact. Takes pre-computed inputs_embeds (float32, shape (1, 512, 2048) for prefill, (1, 1, 2048) for decode). The bench above runs against this file.
  • mega-asr-llm_q4_block32_ekv512.tflite β€” the pure-text artifact. Same Mega-ASR LLM weights, same INT4 quantization, but takes int32 token IDs. Use this if you want to drop the LLM into a LiteRT-LM bundle or run Qwen3-style chat generation without the audio encoder.

Companion repos

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/mega-asr-litert

Quantized
(6)
this model