README.md · Reza2kn/mega-asr-coreml at main

mega-asr-coreml / README.md

Reza2kn

Add mixed 8/4 CoreML (90.6% on VITW) — new recommended variant

005e85f verified about 3 hours ago

preview code

raw

history blame contribute delete

8.06 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	- ja
	- ko
	- multilingual
	library_name: coremltools
	tags:
	- coreml
	- ane
	- apple-neural-engine
	- automatic-speech-recognition
	- asr
	- speech-recognition
	- robust-asr
	- quantized
	- int4
	- 4bit
	- 8bit
	- mixed-precision
	- lut
	- palettize
	- on-device
	- apple-silicon
	- ios
	- macos
	- qwen3
	- qwen3-asr
	- mega-asr
	- anemll
	pipeline_tag: automatic-speech-recognition
	base_model: zhifeixie/Mega-ASR
	base_model_relation: quantized
	---

	# Mega-ASR — CoreML mixed 8/4 (end-to-end ASR)

	CoreML deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR)
	(Qwen3-ASR-1.7B base) with an `input_embeds`-aware decoder so audio
	embeddings can be scattered at `<\|audio_pad\|>` positions to do real ASR —
	not just text generation.

	Converted via [ANEMLL](https://github.com/Anemll/Anemll) with a custom
	`convert_embeds_mixed.py` that:
	1. Monkey-patches `QwenModel.forward` + `QwenForCausalLM.forward` to accept
	pre-embedded `hidden_states` (skipping the internal `embed_tokens`
	lookup) so audio scatter works at inference.
	2. Enumerates the MIL program's const-weight ops by name pattern and applies
	LUT-8 palettization to attention projections (q/k/v/o_proj) and
	LUT-4 to MLP projections (gate/up/down_proj) — mirroring the MLX
	`mixed8_4` recipe that closed the gap to GPTQ on the LLM portion.
	3. Runs `compute_precision=FLOAT32` — fp16 compute precision produces
	all-NaN logits on Qwen3-ASR's RMSNorm/attention (matches the aoiandroid
	community finding for the same base model).

	## What's in this repo

	\| File \| Size \| Role \|
	\| --- \| ---: \| --- \|
	\| `coreml/mega-asr-llm-embeds_mixed8_4.mlpackage/` \| 1.87 GB \| Recommended. Qwen3 1.7B LLM, `inputs_embeds` input, fp32 compute, 8-bit attn + 4-bit MLP, ~5.0 bpw avg. \|
	\| `coreml/mega-asr-llm-embeds_fp32compute_lut4.mlpackage/` \| 826 MB \| Smaller variant. Uniform LUT-4 weights. -3.7% agreement vs mixed. \|
	\| `coreml/mega-asr-llm_lut4.mlpackage/` \| 974 MB \| Standalone Qwen3 text LLM with `input_ids` input (no audio scatter). \|
	\| `onnx/audio_encoder_fp32.onnx` \| 1.27 GB \| 24-layer Whisper-style audio encoder (ONNX, runs via onnxruntime; CoreML port pending) \|
	\| `tokenizer/*` \| — \| Original Qwen3-ASR tokenizer (`<\\|audio_pad\\|>`, `<asr_text>`, etc.) \|
	\| `examples/*.wav` \| ~3 MB \| 8 noisy benchmark clips from Voices-in-the-Wild-Bench \|
	\| `inference_asr.py` \| — \| End-to-end ASR pipeline (ONNX encoder + CoreML LLM) \|
	\| `convert_embeds.py` / `convert_embeds_mixed.py` \| — \| The custom converters \|

	## Quality (bench)

	8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
	agreement (1 − WER), prompt forced to `language English`, ONNX fp32
	audio encoder + the CoreML LLM, ran with `compute_units=ALL` (Metal GPU
	since ANE compilation fails on this model size + stateful KV cache):

	\| Per-sample \| Mixed 8/4 (recommended) \| Uniform LUT-4 \|
	\| --- \| ---: \| ---: \|
	\| distortion \| 100% \| 100% \|
	\| dropout \| 100% \| 100% \|
	\| echo (hard, heavy reverb) \| 64.7% \| 47.1% \|
	\| far_field \| 100% \| 100% \|
	\| mixed \| 100% \| 100% \|
	\| noise \| 100% \| 100% \|
	\| obstructed \| 100% \| 88.2% \|
	\| recording (hard, truncated audio) \| 60.0% \| 60.0% \|
	\| AVERAGE \| 90.6% \| 86.9% \|

	Mixed 8/4 lifts CoreML from 86.9% → 90.6% (+3.7) by allocating the 4
	attention projections per layer to LUT-8 (16 unique values for every 8
	channels) while keeping the 3 MLP projections at LUT-4 (16 unique values
	per 8 channels). Attention layers in Qwen3 are quality-critical — same
	result we found in the MLX port.

	Cross-backend leaderboard (same 8 samples, same audio encoder):

	\| Backend \| Agreement \|
	\| --- \| ---: \|
	\| ONNX recommended (GPTQ INT4) \| 92.7% \|
	\| MLX recommended (mixed 8/4) \| 92.2% \|
	\| CoreML recommended (mixed 8/4) \| 90.6% \|
	\| CoreML LUT-4 baseline \| 86.9% \|
	\| ONNX RTN INT4 baseline \| 87.8% \|

	The remaining ~2% gap to ONNX/MLX is the LUT-vs-GPTQ scheme difference
	(k-means clustering vs activation-aware Hessian redistribution). The two
	hard samples (`echo`, `recording`) are audio-quality-limited and stay
	around 60-65% across all 4-bit backends.

	## Inference

	```bash
	pip install coremltools onnxruntime soundfile transformers safetensors librosa numpy
	git clone https://huggingface.co/Reza2kn/mega-asr-coreml
	cd mega-asr-coreml
	python inference_asr.py \
	--mlpackage coreml/mega-asr-llm-embeds_mixed8_4.mlpackage \
	--encoder-path onnx/audio_encoder_fp32.onnx \
	--examples-dir examples \
	--qwen-asr-dir <local path to Qwen3-ASR-1.7B HF dir> \
	--compute-unit ALL
	```

	The pipeline:
	1. Mel features via Qwen3-ASR's `WhisperFeatureExtractor`.
	2. Audio encoder (ONNX fp32) → audio embeddings `(F, 2048)`.
	3. Prompt + scatter: build the Qwen3-ASR chat template with English
	forcing, expand the single `<\|audio_pad\|>` placeholder to F slots,
	lookup text embeds via the HF model's `embed_tokens` weight, scatter
	audio embeds at the placeholder positions.
	4. CoreML prefill: feed each token's embedding one-at-a-time to
	populate the in-model KV cache state.
	5. CoreML decode: greedy step-by-step until `<\|im_end\|>`.

	The KV cache lives inside the CoreML model as `state`. Call
	`model.make_state()` once per request, then thread the same state object
	through every `predict()` call.

	## Conversion details

	```python
	# Apply per-op-name palettize: attention at LUT-8, MLP at LUT-4.
	prog = mlmodel._mil_program
	for op in prog.functions["main"].operations:
	if op.op_type != "const": continue
	n = op.name.lower()
	if "self_attn" in n and any(p in n for p in ("q_proj","k_proj","v_proj","o_proj")):
	attn_ops.append(op.name)
	elif "mlp" in n and any(p in n for p in ("gate_proj","up_proj","down_proj")):
	mlp_ops.append(op.name)

	config = OptimizationConfig(op_name_configs={
	**{n: OpPalettizerConfig(nbits=8, group_size=8) for n in attn_ops},
	**{n: OpPalettizerConfig(nbits=4, group_size=8) for n in mlp_ops},
	})
	mlmodel = palettize_weights(mlmodel, config)
	```

	The model exposes 84 attention weight ops (28 layers × 3 attention
	projections after the GQA-shared k/v gets clustered into k+v ops) and
	84 MLP weight ops (28 layers × 3 MLP projections).

	`compute_precision=FLOAT32` is mandatory — fp16 compute on Qwen3-ASR
	produces all-NaN logits (RMSNorm + attention score overflow).

	A `coremltools` local patch was needed in
	`coremltools/converters/mil/frontend/torch/ops.py` `_cast`: numpy arrays
	of size 1 need to be coerced to scalar via `.flatten()[0].item()` before
	the dtype call — see `convert_embeds_mixed.py` setup notes.

	## Known limitations

	1. ANE rejected. CoreML's ANE compiler fails (`MILCompilerForANE
	error: failed to compile ANE model using ANEF`) — likely due to model
	size + stateful KV cache. `CPU_AND_NE` fails to load. `ALL` runs on
	Metal GPU (correct + ~3-4× faster than `CPU_ONLY`), which is the
	recommended setting.
	2. Audio encoder is ONNX. The 24-layer Whisper-style encoder isn't
	ported to CoreML yet (ANEMLL is LLM-only). End-to-end runs the
	encoder via `onnxruntime` and the LLM via `coremltools`.
	3. Quality below ONNX/MLX by ~2% at 4-bit, due to LUT k-means being
	weaker than GPTQ on this architecture. The uniform LUT-4 variant is
	smaller (826 MB) if size is critical; the mixed 8/4 (1.87 GB) is
	recommended for best quality.

	## Companion repos

	- [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx) — full ONNX pipeline (GPTQ-INT4, 92.7%)
	- [Reza2kn/mega-asr-mlx](https://huggingface.co/Reza2kn/mega-asr-mlx) — MLX 4-bit (mixed 8/4 attn/MLP, 92.2%)
	- [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) — browser demo (WebGPU)

	## Credits

	- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B, Apache-2.0)
	- CoreML conversion via [ANEMLL](https://github.com/Anemll/Anemll) with custom input_embeds + mixed-precision patches
	- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)