Reza2kn
/

mega-asr-litert

+---
+license: apache-2.0
+language:
+- en
+- zh
+- ja
+- ko
+- multilingual
+library_name: ai-edge-litert
+tags:
+- litert
+- ai-edge-litert
+- litert-torch
+- tflite
+- quantized
+- dynamic-int4
+- block32
+- qwen3
+- qwen3-asr
+- mega-asr
+- automatic-speech-recognition
+- asr
+- speech-recognition
+- robust-asr
+- android
+- mobile
+- edge
+base_model: zhifeixie/Mega-ASR
+base_model_relation: quantized
+---
+# Mega-ASR — LiteRT (TFLite) dynamic INT4, block 32
+[LiteRT](https://ai.google.dev/edge/litert) (formerly TFLite) deployment of
+the LLM portion of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
+converted via [`litert-torch`](https://github.com/google-ai-edge/litert-torch)
+(formerly `ai-edge-torch`) with the `dynamic_int4` recipe at block size 32.
+**Targets Android, ChromeOS, embedded Linux, and any LiteRT runtime.**
+Apple platforms get better results from the CoreML variant; NVIDIA GPUs get
+better results from NVFP4 — this artifact exists for the LiteRT ecosystem.
+## What's in this repo
+| File | Size | Role |
+| --- | ---: | --- |
+| `litert/mega-asr-llm_embeds_q4_block32_ekv1024.tflite` | **975 MB** | Main ASR artifact. Takes pre-computed `inputs_embeds` (audio embeds scattered at `<\|audio_pad\|>` positions). Signatures: `prefill_512` + `decode`. KV cache external, max length 1024. |
+| `litert/mega-asr-llm_q4_block32_ekv512.tflite` | 972 MB | Pure-text Qwen3-1.7B variant. Takes int32 token IDs; embed_tokens baked into the graph. Use this for LiteRT-LM bundling or pure-text Qwen3 inference. **Not directly usable for ASR** — the audio path needs external embedding scatter. |
+| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, runs via onnxruntime; LiteRT port not done — its op coverage gap on the audio encoder is wider than the LLM and the runtime savings are marginal). |
+| `tokenizer/*` | — | Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, …) |
+| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
+| `litert_convert_embeds.py` | — | The conversion script (litert-torch + dynamic_int4 block 32) |
+| `inference_bench.py` | — | End-to-end ASR pipeline + 8-clip VITW bench (used to produce the numbers below) |
+## Quality (bench)
+8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
+agreement (1 − WER), prompt forced to `language English`, run on the same ONNX
+fp32 audio encoder as the other backends. LiteRT runtime via XNNPACK delegate
+(CPU on stallion x86_64 — Android/ARM would run the same .tflite via the same
+delegate at lower latency-per-watt):
+| Per-sample | LiteRT (this repo) | ONNX GPTQ | MLX 8/4 | CoreML 8/4 | NVFP4 |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| distortion | 100% | 100% | 100% | 100% | 100% |
+| dropout | 100% | 100% | 100% | 100% | 100% |
+| echo (hard, reverb) | 58.8% | 82.4% | 64.7% | 64.7% | 64.7% |
+| far_field | 100% | 100% | 100% | 100% | 100% |
+| mixed | 100% | 100% | 100% | 100% | 100% |
+| noise | 100% | 100% | 100% | 100% | 100% |
+| obstructed | 94.1% | 100% | 94.1% | 100% | 100% |
+| recording (hard, truncated) | 33.3% | 60.0% | 60.0% | 60.0% | **66.7%** |
+| **AVERAGE** | **85.8%** | **92.7%** | **92.2%** | **90.6%** | **91.4%** |
+LiteRT lands ~5-7 pts behind the other 4-bit backends, with both losses
+concentrated on the two hard clips:
+- `recording` (truncated audio): 33.3% — the dynamic-range INT4 quantization
+  with no calibration is the harshest setting here, and the truncated-audio
+  decode pattern (where context-aware confidence really matters) is the most
+  sensitive to weight precision loss. AWQ-style activation-aware scaling
+  (NVFP4) or per-column GPTQ (ONNX) recover ~25-30 pts on this clip; pure
+  dynamic-range quant doesn't have the headroom.
+- `echo` (heavy reverb): 58.8% — same pattern, smaller gap (~6 pts vs MLX/CoreML).
+The clean clips are all 100% — the quantization is fine for the easy cases;
+it's the hard cases where the activation/weight precision interaction kicks in.
+## How LiteRT dynamic INT4 block 32 works (quick)
+- Weights are stored as **INT4 symmetric**, one scale per **block of 32**
+  consecutive elements along the input dimension (so a Linear layer with
+  in_features=2048 has 64 scales per output row).
+- Scales are **FP32** (no second-level FP8 like NVFP4; LiteRT's runtime
+  uses XNNPACK's INT4 GEMM kernel, which expects FP32 block scales).
+- At inference, the XNNPACK delegate dequantizes blocks on-the-fly into
+  FP32 / FP16 (depending on accumulator) and runs the GEMM. No native INT4
+  tensor cores on commodity Android SoCs — the win is **memory bandwidth**
+  (4 bits/weight + tiny scale overhead vs 16 bits/weight).
+"Dynamic" means activations stay FP32; only the weights are quantized. This
+trades some quality for simplicity (no calibration needed) and ops coverage
+(every Linear op the LiteRT runtime knows can take INT4 weights, no special
+calibrated activation observers required).
+## Inference
+```bash
+pip install ai-edge-litert onnxruntime transformers safetensors soundfile librosa numpy
+git clone https://huggingface.co/Reza2kn/mega-asr-litert
+cd mega-asr-litert
+python inference_bench.py \
+    --tflite litert/mega-asr-llm_embeds_q4_block32_ekv1024.tflite \
+    --encoder onnx/audio_encoder_fp32.onnx \
+    --examples-dir examples
+```
+The bench script runs the full ASR pipeline:
+1. Audio file → mel-spectrogram (`AutoFeatureExtractor`).
+2. mel → ONNX audio encoder → audio embeddings of shape `(T_audio, 2048)`.
+3. Build prompt with `<|audio_pad|>` expanded to `T_audio` positions; tokenize
+   the chat-style prefix.
+4. Look up text token embeddings from the `embed_tokens` weight in the
+   accompanying safetensors (this is the embedding layer that's **not** baked
+   into the inputs_embeds variant — by design).
+5. Scatter audio embeds at the `<|audio_pad|>` positions in the embed sequence.
+6. Pad to 512 and call the `prefill_512` signature; KV cache is external
+   (28 layers × 2 K/V × float32, shape `(1, 1024, 8, 128)`).
+7. Greedy decode loop on the `decode` signature until EOS (`151645`).
+## Conversion details
+```python
+from litert_torch._convert import interface as converter_utils
+from litert_torch.generative.examples.qwen import qwen3
+from litert_torch.generative.layers import kv_cache as kv_utils
+from litert_torch.generative.quantize import quant_attrs, quant_recipes
+model = qwen3.build_1_7b_model(
+    checkpoint_path="Mega-ASR-LLM/",  # safetensors of the merged LLM portion
+    mask_cache_size=1024,
+)
+class EmbedsForward(torch.nn.Module):
+    """Bypass tok_embedding — public forward takes inputs_embeds instead of tokens."""
+    def __init__(self, inner):
+        super().__init__()
+        self.m, self.cfg = inner, inner.config
+    def forward(self, inputs_embeds, input_pos, kv_cache):
+        attn = self.cfg.block_config(0).attn_config
+        n_elem = int(attn.rotary_percentage * attn.head_dim)
+        rope = self.cfg.build_rope(input_pos, n_elem, attn.rotary_base)
+        mask = self.m.mask_cache.index_select(2, input_pos)
+        mask = mask[:, :, :, :kv_cache.get_max_seq_len()]
+        return self.m._forward_with_embeds(
+            inputs_embeds, rope, mask, input_pos, kv_cache, None, None
+        )
+wrapper = EmbedsForward(model).eval()
+qcfg = quant_recipes.full_dynamic_recipe(
+    mcfg=model.config,
+    weight_dtype=quant_attrs.Dtype.INT4,
+    granularity=quant_attrs.Granularity.BLOCKWISE_32,
+)
+converter = converter_utils.Converter()
+converter.add_signature("prefill_512", wrapper, sample_kwargs={
+    "inputs_embeds": torch.zeros(1, 512, 2048), "input_pos": torch.arange(512, dtype=torch.int),
+    "kv_cache": kv_utils.KVCache.from_model_config(1024, model.config),
+})
+converter.add_signature("decode", wrapper, sample_kwargs={
+    "inputs_embeds": torch.zeros(1, 1, 2048), "input_pos": torch.tensor([0], dtype=torch.int),
+    "kv_cache": kv_utils.KVCache.from_model_config(1024, model.config),
+})
+litert_model = converter.convert(quant_config=qcfg)
+litert_model.export("mega-asr-llm_embeds_q4_block32_ekv1024.tflite")
+```
+Conversion takes ~2 minutes on an x86 box but requires ~30 GB RAM peak (we
+added 24 GB swap on top of the host's 32 GB physical RAM to survive the
+"Write Model to Bytes" step — the intermediate fp32 model size is ~6.4 GB
+and the flatbuffer writer needs ~3-4× that in headroom).
+## Why two .tflite files?
+The default `litert-torch` `convert_v3_to_tflite` script produces a tflite
+whose `prefill` signature takes int32 **token IDs** — `embed_tokens` lives
+inside the graph. That's fine for pure text generation (and is what LiteRT-LM
+expects). But ASR needs to inject **audio embeddings** at the `<|audio_pad|>`
+positions **before** the LLM, which requires bypassing the internal embed
+layer.
+So we ship both:
+- `mega-asr-llm_embeds_q4_block32_ekv1024.tflite` — the ASR artifact. Takes
+  pre-computed `inputs_embeds` (float32, shape `(1, 512, 2048)` for prefill,
+  `(1, 1, 2048)` for decode). The bench above runs against this file.
+- `mega-asr-llm_q4_block32_ekv512.tflite` — the pure-text artifact. Same
+  Mega-ASR LLM weights, same INT4 quantization, but takes int32 token IDs.
+  Use this if you want to drop the LLM into a LiteRT-LM bundle or run
+  Qwen3-style chat generation without the audio encoder.
+## Companion repos
+- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4, 92.7%)
+- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed 8/4, 92.2%)
+- [Reza2kn/mega-asr-nvfp4](https://huggingface.co/Reza2kn/mega-asr-nvfp4) — NVFP4 AWQ-Lite (Blackwell, 91.4%)
+- [Reza2kn/mega-asr-coreml](https://huggingface.co/Reza2kn/mega-asr-coreml) — CoreML 4-bit (mixed 8/4, 90.6%)
+- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — browser demo (WebGPU)
+## Credits
+- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
+- Conversion: [google-ai-edge/litert-torch](https://github.com/google-ai-edge/litert-torch) v0.6 (`dynamic_int4` block 32)
+- Runtime: [ai-edge-litert](https://pypi.org/project/ai-edge-litert/)
+- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)