aoiandroid commited on
Commit
3fa6650
·
verified ·
1 Parent(s): 597fc62

Clone from weiren119/Qwen3-ASR-1.7B-CoreML

Browse files
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-ASR-1.7B
4
+ tags:
5
+ - coreml
6
+ - asr
7
+ - speech-to-text
8
+ - qwen3
9
+ - on-device
10
+ - apple
11
+ - macos
12
+ language:
13
+ - zh
14
+ - en
15
+ - ja
16
+ - ko
17
+ - multilingual
18
+ pipeline_tag: automatic-speech-recognition
19
+ library_name: coremltools
20
+ ---
21
+
22
+ # Qwen3-ASR-1.7B CoreML
23
+
24
+ CoreML conversion of [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) for on-device speech-to-text on macOS/iOS.
25
+
26
+ Optimized with **FP32 compute + ANEMLL RMSNorm + mixed-precision INT8** to prevent decoder overflow while keeping model size small.
27
+
28
+ ## Files
29
+
30
+ | File | Description | Size |
31
+ |------|-------------|------|
32
+ | `qwen3_asr_encoder_int8.mlpackage/` | Audio encoder (Conv2D + 24-layer Transformer + projection). FP16 compute, INT8 per-channel weights. | 304 MB |
33
+ | `qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/` | Decoder (28-layer causal LLM + LM head). FP32 compute, ANEMLL RMSNorm, mixed-precision INT8. | 2.8 GB |
34
+ | `qwen3_asr_embeddings.bin` | Token embedding table (FP16, shape 151936 x 2048). Loaded separately to avoid quantizing embeddings. | 594 MB |
35
+
36
+ ## Architecture
37
+
38
+ ```
39
+ Audio (16kHz) → Mel Spectrogram (128 bins)
40
+ → Encoder (100-frame chunks → 13 tokens each)
41
+ → Embedding lookup + prompt construction
42
+ → Decoder (autoregressive, RoPE as inputs, KV cache)
43
+ → Token IDs → Text
44
+ ```
45
+
46
+ ### Encoder
47
+
48
+ - 3x Conv2D (stride 2) + 24-layer Transformer + 2-layer projection
49
+ - Input: `[1, 1, 128, 100]` mel chunk → Output: `[1, 13, 2048]` audio features
50
+ - Bidirectional attention, sinusoidal positional embeddings (reset per chunk)
51
+ - INT8 per-channel quantization (first conv layer kept at FP16)
52
+
53
+ ### Decoder
54
+
55
+ - 28-layer causal Transformer with GQA (16 Q heads, 8 KV heads, head_dim=128)
56
+ - SwiGLU MLP (intermediate_size=11008)
57
+ - RoPE theta=1,000,000 (NeoX/split-half style, passed as cos/sin inputs)
58
+ - ANEMLL-style RMSNorm: `LayerNorm([x, -x])` trick for numerical stability
59
+ - Mixed-precision INT8: first/last layer, norms, LM head kept at FP16
60
+
61
+ ### Key Quality Improvements
62
+
63
+ 1. **FP32 compute for decoder** — Hidden states reach 10,876 at layer 25; x^2 = 118M >> FP16 max 65,504. FP32 is mandatory.
64
+ 2. **ANEMLL RMSNorm** — Uses native `layer_norm` op via `[x, -x]` concatenation for GPU/ANE precision.
65
+ 3. **Mixed-precision INT8** — Skips 72 sensitive ops (embeddings, first/last layer, all norms, LM head).
66
+ 4. **RoPE as inputs** — Precomputed sin/cos passed in to avoid in-model precision loss.
67
+ 5. **Embeddings extracted** — 151,936-token embedding table stored as raw FP16 binary, not quantized.
68
+
69
+ ## Quality Benchmarks
70
+
71
+ ### English (JFK speech, 11s)
72
+
73
+ | Pipeline | Transcription | Match |
74
+ |----------|---------------|-------|
75
+ | PyTorch FP32 | "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." | Reference |
76
+ | CoreML INT8 | Same text (minor punctuation difference: "." vs ";") | **MATCH** |
77
+
78
+ ### Taiwan Mandarin (Taiwan-Tongues test set, 20 samples, 60s total)
79
+
80
+ | Metric | Value |
81
+ |--------|-------|
82
+ | PyTorch vs CoreML match rate | **19/20 (95%)** |
83
+ | Average CER (PyTorch) | 0.2009 |
84
+ | Average CER (CoreML INT8) | 0.2064 |
85
+ | CER difference | **+0.0056** |
86
+
87
+ ## Usage Notes
88
+
89
+ - **macOS 14+** required (minimum deployment target)
90
+ - Decoder runs on **GPU only** (FP32 compute prevents ANE execution)
91
+ - KV cache must be initialized with **1 dummy slot** (CoreML crashes on size-0 dynamic dims); mask the dummy position with `-inf`
92
+ - Encoder processes mel in **100-frame chunks** — split mel, run encoder per chunk, concatenate outputs
93
+
94
+ ## Conversion Details
95
+
96
+ Converted using `coremltools 9.0` with PyTorch 2.10.0. Conversion scripts available at the source repository.
97
+
98
+ ```
99
+ compute_precision: FLOAT32 (decoder), FLOAT16 (encoder)
100
+ minimum_deployment_target: macOS14
101
+ quantization: linear_symmetric INT8 per-channel (mixed-precision for decoder)
102
+ ```
103
+
104
+ ## License
105
+
106
+ Apache 2.0 (same as base model)
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17daab141527f827aaa4f70723f6cd7505a57d0a48489538f9e6b37e74440488
3
+ size 652295
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c7d30e1d6dbfb198aef2b5538b639367e2e82bfaba491235b27c124725b4b9f
3
+ size 2959247104
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "83EA2015-A461-4646-84E7-8CE8482C4DBD": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Weights",
7
+ "name": "weights",
8
+ "path": "com.apple.CoreML/weights"
9
+ },
10
+ "CBD34A99-4072-43F9-845E-39103D8364CC": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Specification",
13
+ "name": "model.mlmodel",
14
+ "path": "com.apple.CoreML/model.mlmodel"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "CBD34A99-4072-43F9-845E-39103D8364CC"
18
+ }
qwen3_asr_embeddings.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1489075dbd08fcd6b87bc69c5feca278014d29fa5693bd4aa61e47c1a3a160d4
3
+ size 622329856
qwen3_asr_encoder_int8.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:882f5f24595e505a492d374351770adf89d26bf23e52bf3d2742f01d870004ca
3
+ size 268054
qwen3_asr_encoder_int8.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0c05f49b0c09d9ebfb56385ee8cd41470ac3a0db4424910aae04bab3f7a5281
3
+ size 318315520
qwen3_asr_encoder_int8.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "E1765CD1-7BCA-4A63-AB86-9A5DA4B938A2": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Specification",
7
+ "name": "model.mlmodel",
8
+ "path": "com.apple.CoreML/model.mlmodel"
9
+ },
10
+ "F5A8FC25-8D4B-4B69-8A99-ACEF5614579A": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Weights",
13
+ "name": "weights",
14
+ "path": "com.apple.CoreML/weights"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "E1765CD1-7BCA-4A63-AB86-9A5DA4B938A2"
18
+ }