mega-asr-coreml / README.md
Reza2kn's picture
Add mixed 8/4 CoreML (90.6% on VITW) β€” new recommended variant
005e85f verified
---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- 8bit
- mixed-precision
- lut
- palettize
- on-device
- apple-silicon
- ios
- macos
- qwen3
- qwen3-asr
- mega-asr
- anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---
# Mega-ASR β€” CoreML mixed 8/4 (end-to-end ASR)
CoreML deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR)
(Qwen3-ASR-1.7B base) with an **`input_embeds`-aware decoder** so audio
embeddings can be scattered at `<|audio_pad|>` positions to do real ASR β€”
not just text generation.
Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
`convert_embeds_mixed.py` that:
1. Monkey-patches `QwenModel.forward` + `QwenForCausalLM.forward` to accept
pre-embedded `hidden_states` (skipping the internal `embed_tokens`
lookup) so audio scatter works at inference.
2. Enumerates the MIL program's const-weight ops by name pattern and applies
**LUT-8 palettization to attention projections** (q/k/v/o_proj) and
**LUT-4 to MLP projections** (gate/up/down_proj) β€” mirroring the MLX
`mixed8_4` recipe that closed the gap to GPTQ on the LLM portion.
3. Runs `compute_precision=FLOAT32` β€” fp16 compute precision produces
all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid
community finding for the same base model).
## What's in this repo
| File | Size | Role |
| --- | ---: | --- |
| `coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/` | **1.87 GB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. |
| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | 826 MB | Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. |
| `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Standalone Qwen3 text LLM with `input_ids` input (no audio scatter). |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
| `tokenizer/*` | β€” | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `inference_asr.py` | β€” | End-to-end ASR pipeline (ONNX encoder + CoreML LLM) |
| `convert_embeds.py` / `convert_embeds_mixed.py` | β€” | The custom converters |
## Quality (bench)
8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 βˆ’ WER), prompt forced to `language English`, ONNX fp32
audio encoder + the CoreML LLM, ran with `compute_units=ALL` (Metal GPU
since ANE compilation fails on this model size + stateful KV cache):
| Per-sample | Mixed 8/4 (recommended) | Uniform LUT-4 |
| --- | ---: | ---: |
| distortion | 100% | 100% |
| dropout | 100% | 100% |
| echo (hard, heavy reverb) | **64.7%** | 47.1% |
| far_field | 100% | 100% |
| mixed | 100% | 100% |
| noise | 100% | 100% |
| obstructed | **100%** | 88.2% |
| recording (hard, truncated audio) | 60.0% | 60.0% |
| **AVERAGE** | **90.6%** | 86.9% |
Mixed 8/4 lifts CoreML from 86.9% β†’ 90.6% (+3.7) by allocating the 4
attention projections per layer to LUT-8 (16 unique values for every 8
channels) while keeping the 3 MLP projections at LUT-4 (16 unique values
per 8 channels). Attention layers in Qwen3 are quality-critical β€” same
result we found in the MLX port.
Cross-backend leaderboard (same 8 samples, same audio encoder):
| Backend | Agreement |
| --- | ---: |
| ONNX recommended (GPTQ INT4) | 92.7% |
| MLX recommended (mixed 8/4) | 92.2% |
| **CoreML recommended (mixed 8/4)** | **90.6%** |
| CoreML LUT-4 baseline | 86.9% |
| ONNX RTN INT4 baseline | 87.8% |
The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference
(k-means clustering vs activation-aware Hessian redistribution). The two
hard samples (`echo`, `recording`) are audio-quality-limited and stay
around 60-65% across all 4-bit backends.
## Inference
```bash
pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
--mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
--encoder-path onnx/audio_encoder_fp32.onnx \
--examples-dir examples \
--qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
--compute-unit ALL
```
The pipeline:
1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`.
2. **Audio encoder** (ONNX fp32) β†’ audio embeddings `(F, 2048)`.
3. **Prompt + scatter**: build the Qwen3-ASR chat template with English
forcing, expand the single `<|audio_pad|>` placeholder to F slots,
lookup text embeds via the HF model's `embed_tokens` weight, scatter
audio embeds at the placeholder positions.
4. **CoreML prefill**: feed each token's embedding one-at-a-time to
populate the in-model KV cache state.
5. **CoreML decode**: greedy step-by-step until `<|im_end|>`.
The KV cache lives inside the CoreML model as `state`. Call
`model.make_state()` once per request, then thread the same state object
through every `predict()` call.
## Conversion details
```python
# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
prog = mlmodel._mil_program
for op in prog.functions["main"].operations:
if op.op_type != "const": continue
n = op.name.lower()
if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
attn_ops.append(op.name)
elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
mlp_ops.append(op.name)
config = OptimizationConfig(op_name_configs={
**{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
**{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
})
mlmodel = palettize_weights(mlmodel, config)
```
The model exposes 84 attention weight ops (28 layers Γ— 3 attention
projections after the GQA-shared k/v gets clustered into k+v ops) and
84 MLP weight ops (28 layers Γ— 3 MLP projections).
`compute_precision=FLOAT32` is mandatory β€” fp16 compute on Qwen3-ASR
produces all-NaN logits (RMSNorm + attention score overflow).
A `coremltools` local patch was needed in
`coremltools/converters/mil/frontend/torch/ops.py` `_cast`: numpy arrays
of size 1 need to be coerced to scalar via `.flatten()[0].item()` before
the dtype call β€” see `convert_embeds_mixed.py` setup notes.
## Known limitations
1. **ANE rejected**. CoreML's ANE compiler fails (`MILCompilerForANE
error: failed to compile ANE model using ANEF`) β€” likely due to model
size + stateful KV cache. `CPU_AND_NE` fails to load. `ALL` runs on
**Metal GPU** (correct + ~3-4Γ— faster than `CPU_ONLY`), which is the
recommended setting.
2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder isn't
ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the
encoder via `onnxruntime` and the LLM via `coremltools`.
3. **Quality below ONNX/MLX** by ~2% at 4-bit, due to LUT k-means being
weaker than GPTQ on this architecture. The uniform LUT-4 variant is
smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is
recommended for best quality.
## Companion repos
- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β€” full ONNX pipeline (GPTQ-INT4, 92.7%)
- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β€” MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β€” browser demo (WebGPU)
## Credits
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with custom input_embeds + mixed-precision patches
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)