Add mixed 8/4 CoreML (90.6% on VITW) — new recommended variant

005e85f verified about 1 hour ago

8.06 kB

license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
  - multilingual
library_name: coremltools
tags:
  - coreml
  - ane
  - apple-neural-engine
  - automatic-speech-recognition
  - asr
  - speech-recognition
  - robust-asr
  - quantized
  - int4
  - 4bit
  - 8bit
  - mixed-precision
  - lut
  - palettize
  - on-device
  - apple-silicon
  - ios
  - macos
  - qwen3
  - qwen3-asr
  - mega-asr
  - anemll
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized

Mega-ASR — CoreML mixed 8/4 (end-to-end ASR)

CoreML deployment of zhifeixie/Mega-ASR (Qwen3-ASR-1.7B base) with an input_embeds-aware decoder so audio embeddings can be scattered at <|audio_pad|> positions to do real ASR — not just text generation.

Converted via ANEMLL with a custom convert_embeds_mixed.py that:

Monkey-patches QwenModel.forward + QwenForCausalLM.forward to accept pre-embedded hidden_states (skipping the internal embed_tokens lookup) so audio scatter works at inference.
Enumerates the MIL program's const-weight ops by name pattern and applies LUT-8 palettization to attention projections (q/k/v/o_proj) and LUT-4 to MLP projections (gate/up/down_proj) — mirroring the MLX mixed8_4 recipe that closed the gap to GPTQ on the LLM portion.
Runs compute_precision=FLOAT32 — fp16 compute precision produces all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid community finding for the same base model).

What's in this repo

File	Size	Role
`coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/`	1.87 GB	Recommended. Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg.
`coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/`	826 MB	Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed.
`coreml/mega-asr-llm_lut4.mlpackage/`	974 MB	Standalone Qwen3 text LLM with `input_ids` input (no audio scatter).
`onnx/audio_encoder_fp32.onnx`	1.27 GB	24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending)
`tokenizer/*`	—	Original Qwen3-ASR tokenizer (`<\|audio_pad\|>`, `<asr_text>`, etc.)
`examples/*.wav`	~3 MB	8 noisy benchmark clips from Voices-in-the-Wild-Bench
`inference_asr.py`	—	End-to-end ASR pipeline (ONNX encoder + CoreML LLM)
`convert_embeds.py` / `convert_embeds_mixed.py`	—	The custom converters

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 − WER), prompt forced to language English, ONNX fp32 audio encoder + the CoreML LLM, ran with compute_units=ALL (Metal GPU since ANE compilation fails on this model size + stateful KV cache):

Per-sample	Mixed 8/4 (recommended)	Uniform LUT-4
distortion	100%	100%
dropout	100%	100%
echo (hard, heavy reverb)	64.7%	47.1%
far_field	100%	100%
mixed	100%	100%
noise	100%	100%
obstructed	100%	88.2%
recording (hard, truncated audio)	60.0%	60.0%
AVERAGE	90.6%	86.9%

Mixed 8/4 lifts CoreML from 86.9% → 90.6% (+3.7) by allocating the 4 attention projections per layer to LUT-8 (16 unique values for every 8 channels) while keeping the 3 MLP projections at LUT-4 (16 unique values per 8 channels). Attention layers in Qwen3 are quality-critical — same result we found in the MLX port.

Cross-backend leaderboard (same 8 samples, same audio encoder):

Backend	Agreement
ONNX recommended (GPTQ INT4)	92.7%
MLX recommended (mixed 8/4)	92.2%
CoreML recommended (mixed 8/4)	90.6%
CoreML LUT-4 baseline	86.9%
ONNX RTN INT4 baseline	87.8%

The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference (k-means clustering vs activation-aware Hessian redistribution). The two hard samples (echo, recording) are audio-quality-limited and stay around 60-65% across all 4-bit backends.

Inference

pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-coreml
cd mega-asr-coreml
python inference_asr.py \
    --mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
    --encoder-path onnx/audio_encoder_fp32.onnx \
    --examples-dir examples \
    --qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
    --compute-unit ALL

The pipeline:

Mel features via Qwen3-ASR's WhisperFeatureExtractor.
Audio encoder (ONNX fp32) → audio embeddings (F, 2048).
Prompt + scatter: build the Qwen3-ASR chat template with English forcing, expand the single <|audio_pad|> placeholder to F slots, lookup text embeds via the HF model's embed_tokens weight, scatter audio embeds at the placeholder positions.
CoreML prefill: feed each token's embedding one-at-a-time to populate the in-model KV cache state.
CoreML decode: greedy step-by-step until <|im_end|>.

The KV cache lives inside the CoreML model as state. Call model.make_state() once per request, then thread the same state object through every predict() call.

Conversion details

# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
prog = mlmodel._mil_program
for op in prog.functions["main"].operations:
    if op.op_type != "const": continue
    n = op.name.lower()
    if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
        attn_ops.append(op.name)
    elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
        mlp_ops.append(op.name)

config = OptimizationConfig(op_name_configs={
    **{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
    **{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
})
mlmodel = palettize_weights(mlmodel, config)

The model exposes 84 attention weight ops (28 layers × 3 attention projections after the GQA-shared k/v gets clustered into k+v ops) and 84 MLP weight ops (28 layers × 3 MLP projections).

compute_precision=FLOAT32 is mandatory — fp16 compute on Qwen3-ASR produces all-NaN logits (RMSNorm + attention score overflow).

A coremltools local patch was needed in coremltools/converters/mil/frontend/torch/ops.py _cast: numpy arrays of size 1 need to be coerced to scalar via .flatten()[0].item() before the dtype call — see convert_embeds_mixed.py setup notes.

Known limitations

ANE rejected. CoreML's ANE compiler fails (MILCompilerForANE error: failed to compile ANE model using ANEF) — likely due to model size + stateful KV cache. CPU_AND_NE fails to load. ALL runs on Metal GPU (correct + ~3-4× faster than CPU_ONLY), which is the recommended setting.
Audio encoder is ONNX. The 24-layer Whisper-style encoder isn't ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the encoder via onnxruntime and the LLM via coremltools.
Quality below ONNX/MLX by ~2% at 4-bit, due to LUT k-means being weaker than GPTQ on this architecture. The uniform LUT-4 variant is smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is recommended for best quality.

Companion repos

Reza2kn/mega-asr-onnx — full ONNX pipeline (GPTQ-INT4, 92.7%)
Reza2kn/mega-asr-mlx — MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
Reza2kn/mega-asr-bench — browser demo (WebGPU)

Credits

Original model: zhifeixie/Mega-ASR (1.7B, Apache-2.0)
CoreML conversion via ANEMLL with custom input_embeds + mixed-precision patches
Benchmark: Voices-in-the-Wild-Bench