license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- 8bit
- mixed-precision
- lut
- palettize
- on-device
- apple-silicon
- ios
- macos
- qwen3
- qwen3-asr
- mega-asr
- anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
Mega-ASR β CoreML mixed 8/4 (end-to-end ASR)
CoreML deployment of zhifeixie/Mega-ASR
(Qwen3-ASR-1.7B base) with an input_embeds-aware decoder so audio
embeddings can be scattered at <|audio_pad|> positions to do real ASR β
not just text generation.
Converted via ANEMLL with a custom
convert_embeds_mixed.py that:
- Monkey-patches
QwenModel.forward+QwenForCausalLM.forwardto accept pre-embeddedhidden_states(skipping the internalembed_tokenslookup) so audio scatter works at inference. - Enumerates the MIL program's const-weight ops by name pattern and applies
LUT-8 palettization to attention projections (q/k/v/o_proj) and
LUT-4 to MLP projections (gate/up/down_proj) β mirroring the MLX
mixed8_4recipe that closed the gap to GPTQ on the LLM portion. - Runs
compute_precision=FLOAT32β fp16 compute precision produces all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid community finding for the same base model).
What's in this repo
| File | Size | Role |
|---|---|---|
coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/ |
1.87 GB | Recommended. Qwen3 1.7B LLM, inputs_embeds input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. |
coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/ |
826 MB | Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. |
coreml/mega-asr-llm_lut4.mlpackage/ |
974 MB | Standalone Qwen3 text LLM with input_ids input (no audio scatter). |
onnx/audio_encoder_fp32.onnx |
1.27 GB | 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) |
tokenizer/* |
β | Original Qwen3-ASR tokenizer (<|audio_pad|>, <asr_text>, etc.) |
examples/*.wav |
~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
inference_asr.py |
β | End-to-end ASR pipeline (ONNX encoder + CoreML LLM) |
convert_embeds.py / convert_embeds_mixed.py |
β | The custom converters |
Quality (bench)
8-clip Voices-in-the-Wild-Bench
agreement (1 β WER), prompt forced to language English, ONNX fp32
audio encoder + the CoreML LLM, ran with compute_units=ALL (Metal GPU
since ANE compilation fails on this model size + stateful KV cache):
| Per-sample | Mixed 8/4 (recommended) | Uniform LUT-4 |
|---|---|---|
| distortion | 100% | 100% |
| dropout | 100% | 100% |
| echo (hard, heavy reverb) | 64.7% | 47.1% |
| far_field | 100% | 100% |
| mixed | 100% | 100% |
| noise | 100% | 100% |
| obstructed | 100% | 88.2% |
| recording (hard, truncated audio) | 60.0% | 60.0% |
| AVERAGE | 90.6% | 86.9% |
Mixed 8/4 lifts CoreML from 86.9% β 90.6% (+3.7) by allocating the 4 attention projections per layer to LUT-8 (16 unique values for every 8 channels) while keeping the 3 MLP projections at LUT-4 (16 unique values per 8 channels). Attention layers in Qwen3 are quality-critical β same result we found in the MLX port.
Cross-backend leaderboard (same 8 samples, same audio encoder):
| Backend | Agreement |
|---|---|
| ONNX recommended (GPTQ INT4) | 92.7% |
| MLX recommended (mixed 8/4) | 92.2% |
| CoreML recommended (mixed 8/4) | 90.6% |
| CoreML LUT-4 baseline | 86.9% |
| ONNX RTN INT4 baseline | 87.8% |
The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference
(k-means clustering vs activation-aware Hessian redistribution). The two
hard samples (echo, recording) are audio-quality-limited and stay
around 60-65% across all 4-bit backends.
Inference
pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
--mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
--encoder-path onnx/audio_encoder_fp32.onnx \
--examples-dir examples \
--qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
--compute-unit ALL
The pipeline:
- Mel features via Qwen3-ASR's
WhisperFeatureExtractor. - Audio encoder (ONNX fp32) β audio embeddings
(F, 2048). - Prompt + scatter: build the Qwen3-ASR chat template with English
forcing, expand the single
<|audio_pad|>placeholder to F slots, lookup text embeds via the HF model'sembed_tokensweight, scatter audio embeds at the placeholder positions. - CoreML prefill: feed each token's embedding one-at-a-time to populate the in-model KV cache state.
- CoreML decode: greedy step-by-step until
<|im_end|>.
The KV cache lives inside the CoreML model as state. Call
model.make_state() once per request, then thread the same state object
through every predict() call.
Conversion details
# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
prog = mlmodel._mil_program
for op in prog.functions["main"].operations:
if op.op_type != "const": continue
n = op.name.lower()
if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
attn_ops.append(op.name)
elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
mlp_ops.append(op.name)
config = OptimizationConfig(op_name_configs={
**{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
**{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
})
mlmodel = palettize_weights(mlmodel, config)
The model exposes 84 attention weight ops (28 layers Γ 3 attention projections after the GQA-shared k/v gets clustered into k+v ops) and 84 MLP weight ops (28 layers Γ 3 MLP projections).
compute_precision=FLOAT32 is mandatory β fp16 compute on Qwen3-ASR
produces all-NaN logits (RMSNorm + attention score overflow).
A coremltools local patch was needed in
coremltools/converters/mil/frontend/torch/ops.py _cast: numpy arrays
of size 1 need to be coerced to scalar via .flatten()[0].item() before
the dtype call β see convert_embeds_mixed.py setup notes.
Known limitations
- ANE rejected. CoreML's ANE compiler fails (
MILCompilerForANE error: failed to compile ANE model using ANEF) β likely due to model size + stateful KV cache.CPU_AND_NEfails to load.ALLruns on Metal GPU (correct + ~3-4Γ faster thanCPU_ONLY), which is the recommended setting. - Audio encoder is ONNX. The 24-layer Whisper-style encoder isn't
ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the
encoder via
onnxruntimeand the LLM viacoremltools. - Quality below ONNX/MLX by ~2% at 4-bit, due to LUT k-means being weaker than GPTQ on this architecture. The uniform LUT-4 variant is smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is recommended for best quality.
Companion repos
- Reza2kn/mega-asr-onnx β full ONNX pipeline (GPTQ-INT4, 92.7%)
- Reza2kn/mega-asr-mlx β MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
- Reza2kn/mega-asr-bench β browser demo (WebGPU)
Credits
- Original model: zhifeixie/Mega-ASR (1.7B, Apache-2.0)
- CoreML conversion via ANEMLL with custom input_embeds + mixed-precision patches
- Benchmark: Voices-in-the-Wild-Bench