Clone from weiren119/Qwen3-ASR-1.7B-CoreML

Browse files

Files changed (8) hide show

README.md +106 -0
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Data/com.apple.CoreML/model.mlmodel +3 -0
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Data/com.apple.CoreML/weights/weight.bin +3 -0
qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Manifest.json +18 -0
qwen3_asr_embeddings.bin +3 -0
qwen3_asr_encoder_int8.mlpackage/Data/com.apple.CoreML/model.mlmodel +3 -0
qwen3_asr_encoder_int8.mlpackage/Data/com.apple.CoreML/weights/weight.bin +3 -0
qwen3_asr_encoder_int8.mlpackage/Manifest.json +18 -0

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-ASR-1.7B
+tags:
+  - coreml
+  - asr
+  - speech-to-text
+  - qwen3
+  - on-device
+  - apple
+  - macos
+language:
+  - zh
+  - en
+  - ja
+  - ko
+  - multilingual
+pipeline_tag: automatic-speech-recognition
+library_name: coremltools
+---
+# Qwen3-ASR-1.7B CoreML
+CoreML conversion of [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) for on-device speech-to-text on macOS/iOS.
+Optimized with **FP32 compute + ANEMLL RMSNorm + mixed-precision INT8** to prevent decoder overflow while keeping model size small.
+## Files
+| File | Description | Size |
+|------|-------------|------|
+| `qwen3_asr_encoder_int8.mlpackage/` | Audio encoder (Conv2D + 24-layer Transformer + projection). FP16 compute, INT8 per-channel weights. | 304 MB |
+| `qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/` | Decoder (28-layer causal LLM + LM head). FP32 compute, ANEMLL RMSNorm, mixed-precision INT8. | 2.8 GB |
+| `qwen3_asr_embeddings.bin` | Token embedding table (FP16, shape 151936 x 2048). Loaded separately to avoid quantizing embeddings. | 594 MB |
+## Architecture
+```
+Audio (16kHz) → Mel Spectrogram (128 bins)
+  → Encoder (100-frame chunks → 13 tokens each)
+  → Embedding lookup + prompt construction
+  → Decoder (autoregressive, RoPE as inputs, KV cache)
+  → Token IDs → Text
+```
+### Encoder
+- 3x Conv2D (stride 2) + 24-layer Transformer + 2-layer projection
+- Input: `[1, 1, 128, 100]` mel chunk → Output: `[1, 13, 2048]` audio features
+- Bidirectional attention, sinusoidal positional embeddings (reset per chunk)
+- INT8 per-channel quantization (first conv layer kept at FP16)
+### Decoder
+- 28-layer causal Transformer with GQA (16 Q heads, 8 KV heads, head_dim=128)
+- SwiGLU MLP (intermediate_size=11008)
+- RoPE theta=1,000,000 (NeoX/split-half style, passed as cos/sin inputs)
+- ANEMLL-style RMSNorm: `LayerNorm([x, -x])` trick for numerical stability
+- Mixed-precision INT8: first/last layer, norms, LM head kept at FP16
+### Key Quality Improvements
+1. **FP32 compute for decoder** — Hidden states reach 10,876 at layer 25; x^2 = 118M >> FP16 max 65,504. FP32 is mandatory.
+2. **ANEMLL RMSNorm** — Uses native `layer_norm` op via `[x, -x]` concatenation for GPU/ANE precision.
+3. **Mixed-precision INT8** — Skips 72 sensitive ops (embeddings, first/last layer, all norms, LM head).
+4. **RoPE as inputs** — Precomputed sin/cos passed in to avoid in-model precision loss.
+5. **Embeddings extracted** — 151,936-token embedding table stored as raw FP16 binary, not quantized.
+## Quality Benchmarks
+### English (JFK speech, 11s)
+| Pipeline | Transcription | Match |
+|----------|---------------|-------|
+| PyTorch FP32 | "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." | Reference |
+| CoreML INT8 | Same text (minor punctuation difference: "." vs ";") | **MATCH** |
+### Taiwan Mandarin (Taiwan-Tongues test set, 20 samples, 60s total)
+| Metric | Value |
+|--------|-------|
+| PyTorch vs CoreML match rate | **19/20 (95%)** |
+| Average CER (PyTorch) | 0.2009 |
+| Average CER (CoreML INT8) | 0.2064 |
+| CER difference | **+0.0056** |
+## Usage Notes
+- **macOS 14+** required (minimum deployment target)
+- Decoder runs on **GPU only** (FP32 compute prevents ANE execution)
+- KV cache must be initialized with **1 dummy slot** (CoreML crashes on size-0 dynamic dims); mask the dummy position with `-inf`
+- Encoder processes mel in **100-frame chunks** — split mel, run encoder per chunk, concatenate outputs
+## Conversion Details
+Converted using `coremltools 9.0` with PyTorch 2.10.0. Conversion scripts available at the source repository.
+```
+compute_precision: FLOAT32 (decoder), FLOAT16 (encoder)
+minimum_deployment_target: macOS14
+quantization: linear_symmetric INT8 per-channel (mixed-precision for decoder)
+```
+## License
+Apache 2.0 (same as base model)

qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17daab141527f827aaa4f70723f6cd7505a57d0a48489538f9e6b37e74440488
+size 652295

qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7c7d30e1d6dbfb198aef2b5538b639367e2e82bfaba491235b27c124725b4b9f
+size 2959247104

qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/Manifest.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "fileFormatVersion": "1.0.0",
+    "itemInfoEntries": {
+        "83EA2015-A461-4646-84E7-8CE8482C4DBD": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
+        },
+        "CBD34A99-4072-43F9-845E-39103D8364CC": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Specification",
+            "name": "model.mlmodel",
+            "path": "com.apple.CoreML/model.mlmodel"
+        }
+    },
+    "rootModelIdentifier": "CBD34A99-4072-43F9-845E-39103D8364CC"
+}

qwen3_asr_embeddings.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1489075dbd08fcd6b87bc69c5feca278014d29fa5693bd4aa61e47c1a3a160d4
+size 622329856

qwen3_asr_encoder_int8.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:882f5f24595e505a492d374351770adf89d26bf23e52bf3d2742f01d870004ca
+size 268054

qwen3_asr_encoder_int8.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c0c05f49b0c09d9ebfb56385ee8cd41470ac3a0db4424910aae04bab3f7a5281
+size 318315520

qwen3_asr_encoder_int8.mlpackage/Manifest.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "fileFormatVersion": "1.0.0",
+    "itemInfoEntries": {
+        "E1765CD1-7BCA-4A63-AB86-9A5DA4B938A2": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Specification",
+            "name": "model.mlmodel",
+            "path": "com.apple.CoreML/model.mlmodel"
+        },
+        "F5A8FC25-8D4B-4B69-8A99-ACEF5614579A": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
+        }
+    },
+    "rootModelIdentifier": "E1765CD1-7BCA-4A63-AB86-9A5DA4B938A2"
+}