Qwen3-ASR-1.7B CoreML
CoreML conversion of Qwen/Qwen3-ASR-1.7B for on-device speech-to-text on macOS/iOS.
Optimized with FP32 compute + ANEMLL RMSNorm + mixed-precision INT8 to prevent decoder overflow while keeping model size small.
Files
| File | Description | Size |
|---|---|---|
qwen3_asr_encoder_int8.mlpackage/ |
Audio encoder (Conv2D + 24-layer Transformer + projection). FP16 compute, INT8 per-channel weights. | 304 MB |
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/ |
Decoder (28-layer causal LLM + LM head). FP32 compute, ANEMLL RMSNorm, mixed-precision INT8. | 2.8 GB |
qwen3_asr_embeddings.bin |
Token embedding table (FP16, shape 151936 x 2048). Loaded separately to avoid quantizing embeddings. | 594 MB |
Architecture
Audio (16kHz) β Mel Spectrogram (128 bins)
β Encoder (100-frame chunks β 13 tokens each)
β Embedding lookup + prompt construction
β Decoder (autoregressive, RoPE as inputs, KV cache)
β Token IDs β Text
Encoder
- 3x Conv2D (stride 2) + 24-layer Transformer + 2-layer projection
- Input:
[1, 1, 128, 100]mel chunk β Output:[1, 13, 2048]audio features - Bidirectional attention, sinusoidal positional embeddings (reset per chunk)
- INT8 per-channel quantization (first conv layer kept at FP16)
Decoder
- 28-layer causal Transformer with GQA (16 Q heads, 8 KV heads, head_dim=128)
- SwiGLU MLP (intermediate_size=11008)
- RoPE theta=1,000,000 (NeoX/split-half style, passed as cos/sin inputs)
- ANEMLL-style RMSNorm:
LayerNorm([x, -x])trick for numerical stability - Mixed-precision INT8: first/last layer, norms, LM head kept at FP16
Key Quality Improvements
- FP32 compute for decoder β Hidden states reach 10,876 at layer 25; x^2 = 118M >> FP16 max 65,504. FP32 is mandatory.
- ANEMLL RMSNorm β Uses native
layer_normop via[x, -x]concatenation for GPU/ANE precision. - Mixed-precision INT8 β Skips 72 sensitive ops (embeddings, first/last layer, all norms, LM head).
- RoPE as inputs β Precomputed sin/cos passed in to avoid in-model precision loss.
- Embeddings extracted β 151,936-token embedding table stored as raw FP16 binary, not quantized.
Quality Benchmarks
English (JFK speech, 11s)
| Pipeline | Transcription | Match |
|---|---|---|
| PyTorch FP32 | "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." | Reference |
| CoreML INT8 | Same text (minor punctuation difference: "." vs ";") | MATCH |
Taiwan Mandarin (Taiwan-Tongues test set, 20 samples, 60s total)
| Metric | Value |
|---|---|
| PyTorch vs CoreML match rate | 19/20 (95%) |
| Average CER (PyTorch) | 0.2009 |
| Average CER (CoreML INT8) | 0.2064 |
| CER difference | +0.0056 |
Usage Notes
- macOS 14+ required (minimum deployment target)
- Decoder runs on GPU only (FP32 compute prevents ANE execution)
- KV cache must be initialized with 1 dummy slot (CoreML crashes on size-0 dynamic dims); mask the dummy position with
-inf - Encoder processes mel in 100-frame chunks β split mel, run encoder per chunk, concatenate outputs
Conversion Details
Converted using coremltools 9.0 with PyTorch 2.10.0. Conversion scripts available at the source repository.
compute_precision: FLOAT32 (decoder), FLOAT16 (encoder)
minimum_deployment_target: macOS14
quantization: linear_symmetric INT8 per-channel (mixed-precision for decoder)
License
Apache 2.0 (same as base model)
- Downloads last month
- 16
Model tree for weiren119/Qwen3-ASR-1.7B-CoreML
Base model
Qwen/Qwen3-ASR-1.7B