Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
aufklarer 
posted an update 2 days ago
Post
1997
We benchmarked https://github.com/soniqo/speech-swift, our open-source Swift library for on-device speech AI, against Whisper Large v3 (FP16) on LibriSpeech test-clean.

Three models beat it. Two architectural approaches:

Qwen3-ASR (LALM — Qwen3 LLM as ASR decoder, AuT encoder pretrained on ~40M hours) hits 2.35% WER at 1.7B 8-bit, running at 43x real-time on MLX. Greedy decoding matches beam search — the LLM decoder is strong enough that the greedy path is nearly always optimal.

Parakeet TDT (non-autoregressive transducer — FastConformer + TDT joint network) hits 2.74% WER in 634 MB as a CoreML INT8 model on the Neural Engine. No generative hallucination by design. Leaves GPU completely free.

Two findings worth flagging:
- 4-bit quantization is catastrophic for non-English: Korean 6.89% → 19.95% WER on FLEURS. Use 8-bit for multilingual.
- On CoreML, INT8 is 3.3x *faster* than INT4 — opposite of GPU behavior. Native ANE INT8 MACs vs INT4 lookup table indirection.

All numbers reproducible in 15 minutes.

Full article: https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174

Library: https://github.com/soniqo/speech-swift

Models: Qwen/Qwen3-ASR-0.6B, Qwen/Qwen3-ASR-1.7B, nvidia/parakeet-tdt-0.6b-v2
In this post