Instructions to use Reza2kn/mega-asr-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use Reza2kn/mega-asr-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Mega-ASR β LiteRT (TFLite) dynamic INT4, block 32
LiteRT (formerly TFLite) deployment of
the LLM portion of zhifeixie/Mega-ASR,
converted via litert-torch
(formerly ai-edge-torch) with the dynamic_int4 recipe at block size 32.
Targets Android, ChromeOS, embedded Linux, and any LiteRT runtime. Apple platforms get better results from the CoreML variant; NVIDIA GPUs get better results from NVFP4 β this artifact exists for the LiteRT ecosystem.
What's in this repo
| File | Size | Role |
|---|---|---|
litert/mega-asr-llm_embeds_q4_block32_ekv1024.tflite |
975 MB | Main ASR artifact. Takes pre-computed inputs_embeds (audio embeds scattered at <|audio_pad|> positions). Signatures: prefill_512 + decode. KV cache external, max length 1024. |
litert/mega-asr-llm_q4_block32_ekv512.tflite |
972 MB | Pure-text Qwen3-1.7B variant. Takes int32 token IDs; embed_tokens baked into the graph. Use this for LiteRT-LM bundling or pure-text Qwen3 inference. Not directly usable for ASR β the audio path needs external embedding scatter. |
onnx/audio_encoder_fp32.onnx |
1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, runs via onnxruntime; LiteRT port not done β its op coverage gap on the audio encoder is wider than the LLM and the runtime savings are marginal). |
tokenizer/* |
β | Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, β¦) |
examples/*.wav |
~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
litert_convert_embeds.py |
β | The conversion script (litert-torch + dynamic_int4 block 32) |
inference_bench.py |
β | End-to-end ASR pipeline + 8-clip VITW bench (used to produce the numbers below) |
Quality (bench)
8-clip Voices-in-the-Wild-Bench
agreement (1 β WER), prompt forced to language English, run on the same ONNX
fp32 audio encoder as the other backends. LiteRT runtime via XNNPACK delegate
(CPU on stallion x86_64 β Android/ARM would run the same .tflite via the same
delegate at lower latency-per-watt):
| Per-sample | LiteRT (this repo) | ONNX GPTQ | MLX 8/4 | CoreML 8/4 | NVFP4 |
|---|---|---|---|---|---|
| distortion | 100% | 100% | 100% | 100% | 100% |
| dropout | 100% | 100% | 100% | 100% | 100% |
| echo (hard, reverb) | 58.8% | 82.4% | 64.7% | 64.7% | 64.7% |
| far_field | 100% | 100% | 100% | 100% | 100% |
| mixed | 100% | 100% | 100% | 100% | 100% |
| noise | 100% | 100% | 100% | 100% | 100% |
| obstructed | 94.1% | 100% | 94.1% | 100% | 100% |
| recording (hard, truncated) | 33.3% | 60.0% | 60.0% | 60.0% | 66.7% |
| AVERAGE | 85.8% | 92.7% | 92.2% | 90.6% | 91.4% |
LiteRT lands ~5-7 pts behind the other 4-bit backends, with both losses concentrated on the two hard clips:
recording(truncated audio): 33.3% β the dynamic-range INT4 quantization with no calibration is the harshest setting here, and the truncated-audio decode pattern (where context-aware confidence really matters) is the most sensitive to weight precision loss. AWQ-style activation-aware scaling (NVFP4) or per-column GPTQ (ONNX) recover ~25-30 pts on this clip; pure dynamic-range quant doesn't have the headroom.echo(heavy reverb): 58.8% β same pattern, smaller gap (~6 pts vs MLX/CoreML).
The clean clips are all 100% β the quantization is fine for the easy cases; it's the hard cases where the activation/weight precision interaction kicks in.
How LiteRT dynamic INT4 block 32 works (quick)
- Weights are stored as INT4 symmetric, one scale per block of 32 consecutive elements along the input dimension (so a Linear layer with in_features=2048 has 64 scales per output row).
- Scales are FP32 (no second-level FP8 like NVFP4; LiteRT's runtime uses XNNPACK's INT4 GEMM kernel, which expects FP32 block scales).
- At inference, the XNNPACK delegate dequantizes blocks on-the-fly into FP32 / FP16 (depending on accumulator) and runs the GEMM. No native INT4 tensor cores on commodity Android SoCs β the win is memory bandwidth (4 bits/weight + tiny scale overhead vs 16 bits/weight).
"Dynamic" means activations stay FP32; only the weights are quantized. This trades some quality for simplicity (no calibration needed) and ops coverage (every Linear op the LiteRT runtime knows can take INT4 weights, no special calibrated activation observers required).
Inference
pip install ai-edge-litert onnxruntime transformers safetensors soundfile librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-litert
cd mega-asr-litert
python inference_bench.py \
--tflite litert/mega-asr-llm_embeds_q4_block32_ekv1024.tflite \
--encoder onnx/audio_encoder_fp32.onnx \
--examples-dir examples
The bench script runs the full ASR pipeline:
- Audio file β mel-spectrogram (
AutoFeatureExtractor). - mel β ONNX audio encoder β audio embeddings of shape
(T_audio, 2048). - Build prompt with
<|audio_pad|>expanded toT_audiopositions; tokenize the chat-style prefix. - Look up text token embeddings from the
embed_tokensweight in the accompanying safetensors (this is the embedding layer that's not baked into the inputs_embeds variant β by design). - Scatter audio embeds at the
<|audio_pad|>positions in the embed sequence. - Pad to 512 and call the
prefill_512signature; KV cache is external (28 layers Γ 2 K/V Γ float32, shape(1, 1024, 8, 128)). - Greedy decode loop on the
decodesignature until EOS (151645).
Conversion details
from litert_torch._convert import interface as converter_utils
from litert_torch.generative.examples.qwen import qwen3
from litert_torch.generative.layers import kv_cache as kv_utils
from litert_torch.generative.quantize import quant_attrs, quant_recipes
model = qwen3.build_1_7b_model(
checkpoint_path="Mega-ASR-LLM/", # safetensors of the merged LLM portion
mask_cache_size=1024,
)
class EmbedsForward(torch.nn.Module):
"""Bypass tok_embedding β public forward takes inputs_embeds instead of tokens."""
def __init__(self, inner):
super().__init__()
self.m, self.cfg = inner, inner.config
def forward(self, inputs_embeds, input_pos, kv_cache):
attn = self.cfg.block_config(0).attn_config
n_elem = int(attn.rotary_percentage * attn.head_dim)
rope = self.cfg.build_rope(input_pos, n_elem, attn.rotary_base)
mask = self.m.mask_cache.index_select(2, input_pos)
mask = mask[:, :, :, :kv_cache.get_max_seq_len()]
return self.m._forward_with_embeds(
inputs_embeds, rope, mask, input_pos, kv_cache, None, None
)
wrapper = EmbedsForward(model).eval()
qcfg = quant_recipes.full_dynamic_recipe(
mcfg=model.config,
weight_dtype=quant_attrs.Dtype.INT4,
granularity=quant_attrs.Granularity.BLOCKWISE_32,
)
converter = converter_utils.Converter()
converter.add_signature("prefill_512", wrapper, sample_kwargs={
"inputs_embeds": torch.zeros(1, 512, 2048), "input_pos": torch.arange(512, dtype=torch.int),
"kv_cache": kv_utils.KVCache.from_model_config(1024, model.config),
})
converter.add_signature("decode", wrapper, sample_kwargs={
"inputs_embeds": torch.zeros(1, 1, 2048), "input_pos": torch.tensor([0], dtype=torch.int),
"kv_cache": kv_utils.KVCache.from_model_config(1024, model.config),
})
litert_model = converter.convert(quant_config=qcfg)
litert_model.export("mega-asr-llm_embeds_q4_block32_ekv1024.tflite")
Conversion takes ~2 minutes on an x86 box but requires ~30 GB RAM peak (we added 24 GB swap on top of the host's 32 GB physical RAM to survive the "Write Model to Bytes" step β the intermediate fp32 model size is ~6.4 GB and the flatbuffer writer needs ~3-4Γ that in headroom).
Why two .tflite files?
The default litert-torch convert_v3_to_tflite script produces a tflite
whose prefill signature takes int32 token IDs β embed_tokens lives
inside the graph. That's fine for pure text generation (and is what LiteRT-LM
expects). But ASR needs to inject audio embeddings at the <|audio_pad|>
positions before the LLM, which requires bypassing the internal embed
layer.
So we ship both:
mega-asr-llm_embeds_q4_block32_ekv1024.tfliteβ the ASR artifact. Takes pre-computedinputs_embeds(float32, shape(1, 512, 2048)for prefill,(1, 1, 2048)for decode). The bench above runs against this file.mega-asr-llm_q4_block32_ekv512.tfliteβ the pure-text artifact. Same Mega-ASR LLM weights, same INT4 quantization, but takes int32 token IDs. Use this if you want to drop the LLM into a LiteRT-LM bundle or run Qwen3-style chat generation without the audio encoder.
Companion repos
- Reza2kn/mega-asr-onnx β full ONNX pipeline (GPTQ-INT4, 92.7%)
- Reza2kn/mega-asr-mlx β MLX 4-bit (mixed 8/4, 92.2%)
- Reza2kn/mega-asr-nvfp4 β NVFP4 AWQ-Lite (Blackwell, 91.4%)
- Reza2kn/mega-asr-coreml β CoreML 4-bit (mixed 8/4, 90.6%)
- Reza2kn/mega-asr-bench β browser demo (WebGPU)
Credits
- Original model: zhifeixie/Mega-ASR (1.7B, Apache-2.0)
- Conversion: google-ai-edge/litert-torch v0.6 (
dynamic_int4block 32) - Runtime: ai-edge-litert
- Benchmark: Voices-in-the-Wild-Bench
- Downloads last month
- -
Model tree for Reza2kn/mega-asr-litert
Base model
zhifeixie/Mega-ASR