Instructions to use littlebearlabs/Mega-ASR-MLX-int8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use littlebearlabs/Mega-ASR-MLX-int8 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Mega-ASR-MLX-int8 littlebearlabs/Mega-ASR-MLX-int8
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Mega-ASR-MLX-int8
An int8 affine-quantized MLX build of Mega-ASR, derived from
mlx-community/Mega-ASR-MLX-bf16.
Produced for the witness native loader (mlx-mega-asr), which loads the packed
int8 weights directly โ no runtime quantization, so the smaller weights are
also the smaller download (~3.8 GB โ ~2.2 GB).
Mega-ASR is a robustness layer over Qwen3-ASR-1.7B: a tiny audio-quality router classifies each utterance as clean or degraded and switches a dense LoRA adapter in/out of the base weights at inference.
What is and isn't quantized
The 344 linear projections of the audio encoder + text decoder (q/k/v/o_proj,
fc1/fc2, mlp.{gate,up,down}_proj, conv_out, proj1/proj2, and the tied
embed_tokens) are affine-quantized to int8, group_size 64 โ each
<name>.weight is a packed uint32 tensor plus <name>.scales / .biases,
and config.json carries a quantization block.
Everything that must stay precise stays dense bf16: the conv2d subsampling
frontend, all layer norms / biases, the per-head q/k norms, and โ critically โ
the router and LoRA adapter in extras/. The runtime applies the fp32
LoRA deltas on top of the dequantized base, so the per-utterance router/LoRA
robustness switching is fully preserved. int8 is the deliberate default on
Apple Silicon: batch-1 decode is memory-bandwidth-bound, so int8 is faster
and ~1.8ร smaller while staying WER-neutral.
Validation
WER parity vs the bf16 reference is gated by
mlx-mega-asr/examples/int8_prepack_parity.rs (LibriSpeech test-clean): the
pre-packed int8 path must match runtime int8 transcript-for-transcript and stay
within ~0.3% WER of dense bf16. Measured (20 files, Apple Silicon): bf16 1.81% /
int8 1.59% WER, 0/20 transcripts differ from runtime int8, RTF 0.043 โ 0.030.
The repo ships vocab.json + merges.txt (the Qwen2 BPE tokenizer is built
from them at load โ no tokenizer.json).
- Downloads last month
- 23
Quantized
Model tree for littlebearlabs/Mega-ASR-MLX-int8
Base model
Qwen/Qwen3-ASR-1.7B