weiren119 commited on
Commit
ad3eb3c
Β·
verified Β·
1 Parent(s): 5a49088

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-ASR-1.7B
4
+ tags:
5
+ - coreml
6
+ - asr
7
+ - speech-to-text
8
+ - qwen3
9
+ - on-device
10
+ - apple
11
+ - macos
12
+ language:
13
+ - zh
14
+ - en
15
+ - ja
16
+ - ko
17
+ - multilingual
18
+ pipeline_tag: automatic-speech-recognition
19
+ library_name: coremltools
20
+ ---
21
+
22
+ # Qwen3-ASR-1.7B CoreML
23
+
24
+ CoreML conversion of [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) for on-device speech-to-text on macOS/iOS.
25
+
26
+ Optimized with **FP32 compute + ANEMLL RMSNorm + mixed-precision INT8** to prevent decoder overflow while keeping model size small.
27
+
28
+ ## Files
29
+
30
+ | File | Description | Size |
31
+ |------|-------------|------|
32
+ | `qwen3_asr_encoder_int8.mlpackage/` | Audio encoder (Conv2D + 24-layer Transformer + projection). FP16 compute, INT8 per-channel weights. | 304 MB |
33
+ | `qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/` | Decoder (28-layer causal LLM + LM head). FP32 compute, ANEMLL RMSNorm, mixed-precision INT8. | 2.8 GB |
34
+ | `qwen3_asr_embeddings.bin` | Token embedding table (FP16, shape 151936 x 2048). Loaded separately to avoid quantizing embeddings. | 594 MB |
35
+
36
+ ## Architecture
37
+
38
+ ```
39
+ Audio (16kHz) β†’ Mel Spectrogram (128 bins)
40
+ β†’ Encoder (100-frame chunks β†’ 13 tokens each)
41
+ β†’ Embedding lookup + prompt construction
42
+ β†’ Decoder (autoregressive, RoPE as inputs, KV cache)
43
+ β†’ Token IDs β†’ Text
44
+ ```
45
+
46
+ ### Encoder
47
+
48
+ - 3x Conv2D (stride 2) + 24-layer Transformer + 2-layer projection
49
+ - Input: `[1, 1, 128, 100]` mel chunk β†’ Output: `[1, 13, 2048]` audio features
50
+ - Bidirectional attention, sinusoidal positional embeddings (reset per chunk)
51
+ - INT8 per-channel quantization (first conv layer kept at FP16)
52
+
53
+ ### Decoder
54
+
55
+ - 28-layer causal Transformer with GQA (16 Q heads, 8 KV heads, head_dim=128)
56
+ - SwiGLU MLP (intermediate_size=11008)
57
+ - RoPE theta=1,000,000 (NeoX/split-half style, passed as cos/sin inputs)
58
+ - ANEMLL-style RMSNorm: `LayerNorm([x, -x])` trick for numerical stability
59
+ - Mixed-precision INT8: first/last layer, norms, LM head kept at FP16
60
+
61
+ ### Key Quality Improvements
62
+
63
+ 1. **FP32 compute for decoder** β€” Hidden states reach 10,876 at layer 25; x^2 = 118M >> FP16 max 65,504. FP32 is mandatory.
64
+ 2. **ANEMLL RMSNorm** β€” Uses native `layer_norm` op via `[x, -x]` concatenation for GPU/ANE precision.
65
+ 3. **Mixed-precision INT8** β€” Skips 72 sensitive ops (embeddings, first/last layer, all norms, LM head).
66
+ 4. **RoPE as inputs** β€” Precomputed sin/cos passed in to avoid in-model precision loss.
67
+ 5. **Embeddings extracted** β€” 151,936-token embedding table stored as raw FP16 binary, not quantized.
68
+
69
+ ## Quality Benchmarks
70
+
71
+ ### English (JFK speech, 11s)
72
+
73
+ | Pipeline | Transcription | Match |
74
+ |----------|---------------|-------|
75
+ | PyTorch FP32 | "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." | Reference |
76
+ | CoreML INT8 | Same text (minor punctuation difference: "." vs ";") | **MATCH** |
77
+
78
+ ### Taiwan Mandarin (Taiwan-Tongues test set, 20 samples, 60s total)
79
+
80
+ | Metric | Value |
81
+ |--------|-------|
82
+ | PyTorch vs CoreML match rate | **19/20 (95%)** |
83
+ | Average CER (PyTorch) | 0.2009 |
84
+ | Average CER (CoreML INT8) | 0.2064 |
85
+ | CER difference | **+0.0056** |
86
+
87
+ ## Usage Notes
88
+
89
+ - **macOS 14+** required (minimum deployment target)
90
+ - Decoder runs on **GPU only** (FP32 compute prevents ANE execution)
91
+ - KV cache must be initialized with **1 dummy slot** (CoreML crashes on size-0 dynamic dims); mask the dummy position with `-inf`
92
+ - Encoder processes mel in **100-frame chunks** β€” split mel, run encoder per chunk, concatenate outputs
93
+
94
+ ## Conversion Details
95
+
96
+ Converted using `coremltools 9.0` with PyTorch 2.10.0. Conversion scripts available at the source repository.
97
+
98
+ ```
99
+ compute_precision: FLOAT32 (decoder), FLOAT16 (encoder)
100
+ minimum_deployment_target: macOS14
101
+ quantization: linear_symmetric INT8 per-channel (mixed-precision for decoder)
102
+ ```
103
+
104
+ ## License
105
+
106
+ Apache 2.0 (same as base model)