| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| - ja |
| - ko |
| - multilingual |
| library_name: coremltools |
| tags: |
| - coreml |
| - ane |
| - apple-neural-engine |
| - automatic-speech-recognition |
| - asr |
| - speech-recognition |
| - robust-asr |
| - quantized |
| - int4 |
| - 4bit |
| - 8bit |
| - mixed-precision |
| - lut |
| - palettize |
| - on-device |
| - apple-silicon |
| - ios |
| - macos |
| - qwen3 |
| - qwen3-asr |
| - mega-asr |
| - anemll |
| pipeline_tag: automatic-speech-recognition |
| base_model: zhifeixie/Mega-ASR |
| base_model_relation: quantized |
| --- |
| |
| # Mega-ASR β CoreML mixed 8/4 (end-to-end ASR) |
|
|
| CoreML deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) |
| (Qwen3-ASR-1.7B base) with an **`input_embeds`-aware decoder** so audio |
| embeddings can be scattered at `<|audio_pad|>` positions to do real ASR β |
| not just text generation. |
| |
| Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom |
| `convert_embeds_mixed.py` that: |
| 1. Monkey-patches `QwenModel.forward` + `QwenForCausalLM.forward` to accept |
| pre-embedded `hidden_states` (skipping the internal `embed_tokens` |
| lookup) so audio scatter works at inference. |
| 2. Enumerates the MIL program's const-weight ops by name pattern and applies |
| **LUT-8 palettization to attention projections** (q/k/v/o_proj) and |
| **LUT-4 to MLP projections** (gate/up/down_proj) β mirroring the MLX |
| `mixed8_4` recipe that closed the gap to GPTQ on the LLM portion. |
| 3. Runs `compute_precision=FLOAT32` β fp16 compute precision produces |
| all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid |
| community finding for the same base model). |
|
|
| ## What's in this repo |
|
|
| | File | Size | Role | |
| | --- | ---: | --- | |
| | `coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/` | **1.87 GB** | **Recommended.** Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. | |
| | `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` | 826 MB | Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. | |
| | `coreml/mega-asr-llm_lut4.mlpackage/` | 974 MB | Standalone Qwen3 text LLM with `input_ids` input (no audio scatter). | |
| | `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) | |
| | `tokenizer/*` | β | Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.) | |
| | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench | |
| | `inference_asr.py` | β | End-to-end ASR pipeline (ONNX encoder + CoreML LLM) | |
| | `convert_embeds.py` / `convert_embeds_mixed.py` | β | The custom converters | |
|
|
| ## Quality (bench) |
|
|
| 8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
| agreement (1 β WER), prompt forced to `language English`, ONNX fp32 |
| audio encoder + the CoreML LLM, ran with `compute_units=ALL` (Metal GPU |
| since ANE compilation fails on this model size + stateful KV cache): |
|
|
| | Per-sample | Mixed 8/4 (recommended) | Uniform LUT-4 | |
| | --- | ---: | ---: | |
| | distortion | 100% | 100% | |
| | dropout | 100% | 100% | |
| | echo (hard, heavy reverb) | **64.7%** | 47.1% | |
| | far_field | 100% | 100% | |
| | mixed | 100% | 100% | |
| | noise | 100% | 100% | |
| | obstructed | **100%** | 88.2% | |
| | recording (hard, truncated audio) | 60.0% | 60.0% | |
| | **AVERAGE** | **90.6%** | 86.9% | |
| |
| Mixed 8/4 lifts CoreML from 86.9% β 90.6% (+3.7) by allocating the 4 |
| attention projections per layer to LUT-8 (16 unique values for every 8 |
| channels) while keeping the 3 MLP projections at LUT-4 (16 unique values |
| per 8 channels). Attention layers in Qwen3 are quality-critical β same |
| result we found in the MLX port. |
| |
| Cross-backend leaderboard (same 8 samples, same audio encoder): |
| |
| | Backend | Agreement | |
| | --- | ---: | |
| | ONNX recommended (GPTQ INT4) | 92.7% | |
| | MLX recommended (mixed 8/4) | 92.2% | |
| | **CoreML recommended (mixed 8/4)** | **90.6%** | |
| | CoreML LUT-4 baseline | 86.9% | |
| | ONNX RTN INT4 baseline | 87.8% | |
| |
| The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference |
| (k-means clustering vs activation-aware Hessian redistribution). The two |
| hard samples (`echo`, `recording`) are audio-quality-limited and stay |
| around 60-65% across all 4-bit backends. |
| |
| ## Inference |
| |
| ```bash |
| pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy |
| git clone https://huggingface.co/Reza2kn/mega-asr-coreml |
| cd mega-asr-coreml |
| python inference_asr.py \ |
| --mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \ |
| --encoder-path onnx/audio_encoder_fp32.onnx \ |
| --examples-dir examples \ |
| --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \ |
| --compute-unit ALL |
| ``` |
| |
| The pipeline: |
| 1. **Mel features** via Qwen3-ASR's `WhisperFeatureExtractor`. |
| 2. **Audio encoder** (ONNX fp32) β audio embeddings `(F, 2048)`. |
| 3. **Prompt + scatter**: build the Qwen3-ASR chat template with English |
| forcing, expand the single `<|audio_pad|>` placeholder to F slots, |
| lookup text embeds via the HF model's `embed_tokens` weight, scatter |
| audio embeds at the placeholder positions. |
| 4. **CoreML prefill**: feed each token's embedding one-at-a-time to |
| populate the in-model KV cache state. |
| 5. **CoreML decode**: greedy step-by-step until `<|im_end|>`. |
|
|
| The KV cache lives inside the CoreML model as `state`. Call |
| `model.make_state()` once per request, then thread the same state object |
| through every `predict()` call. |
|
|
| ## Conversion details |
|
|
| ```python |
| # Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4. |
| prog = mlmodel._mil_program |
| for op in prog.functions["main"].operations: |
| if op.op_type != "const": continue |
| n = op.name.lower() |
| if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")): |
| attn_ops.append(op.name) |
| elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")): |
| mlp_ops.append(op.name) |
| |
| config = OptimizationConfig(op_name_configs={ |
| **{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops}, |
| **{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops}, |
| }) |
| mlmodel = palettize_weights(mlmodel, config) |
| ``` |
|
|
| The model exposes 84 attention weight ops (28 layers Γ 3 attention |
| projections after the GQA-shared k/v gets clustered into k+v ops) and |
| 84 MLP weight ops (28 layers Γ 3 MLP projections). |
|
|
| `compute_precision=FLOAT32` is mandatory β fp16 compute on Qwen3-ASR |
| produces all-NaN logits (RMSNorm + attention score overflow). |
|
|
| A `coremltools` local patch was needed in |
| `coremltools/converters/mil/frontend/torch/ops.py` `_cast`: numpy arrays |
| of size 1 need to be coerced to scalar via `.flatten()[0].item()` before |
| the dtype call β see `convert_embeds_mixed.py` setup notes. |
|
|
| ## Known limitations |
|
|
| 1. **ANE rejected**. CoreML's ANE compiler fails (`MILCompilerForANE |
| error: failed to compile ANE model using ANEF`) β likely due to model |
| size + stateful KV cache. `CPU_AND_NE` fails to load. `ALL` runs on |
| **Metal GPU** (correct + ~3-4Γ faster than `CPU_ONLY`), which is the |
| recommended setting. |
| 2. **Audio encoder is ONNX**. The 24-layer Whisper-style encoder isn't |
| ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the |
| encoder via `onnxruntime` and the LLM via `coremltools`. |
| 3. **Quality below ONNX/MLX** by ~2% at 4-bit, due to LUT k-means being |
| weaker than GPTQ on this architecture. The uniform LUT-4 variant is |
| smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is |
| recommended for best quality. |
|
|
| ## Companion repos |
|
|
| - [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) β full ONNX pipeline (GPTQ-INT4, 92.7%) |
| - [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) β MLX 4-bit (mixed 8/4 attn/MLP, 92.2%) |
| - [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) β browser demo (WebGPU) |
|
|
| ## Credits |
|
|
| - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0) |
| - CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with custom input_embeds + mixed-precision patches |
| - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
| |