Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: Qwen/Qwen3-ASR-1.7B
|
| 4 |
+
tags:
|
| 5 |
+
- coreml
|
| 6 |
+
- asr
|
| 7 |
+
- speech-to-text
|
| 8 |
+
- qwen3
|
| 9 |
+
- on-device
|
| 10 |
+
- apple
|
| 11 |
+
- macos
|
| 12 |
+
language:
|
| 13 |
+
- zh
|
| 14 |
+
- en
|
| 15 |
+
- ja
|
| 16 |
+
- ko
|
| 17 |
+
- multilingual
|
| 18 |
+
pipeline_tag: automatic-speech-recognition
|
| 19 |
+
library_name: coremltools
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# Qwen3-ASR-1.7B CoreML
|
| 23 |
+
|
| 24 |
+
CoreML conversion of [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) for on-device speech-to-text on macOS/iOS.
|
| 25 |
+
|
| 26 |
+
Optimized with **FP32 compute + ANEMLL RMSNorm + mixed-precision INT8** to prevent decoder overflow while keeping model size small.
|
| 27 |
+
|
| 28 |
+
## Files
|
| 29 |
+
|
| 30 |
+
| File | Description | Size |
|
| 31 |
+
|------|-------------|------|
|
| 32 |
+
| `qwen3_asr_encoder_int8.mlpackage/` | Audio encoder (Conv2D + 24-layer Transformer + projection). FP16 compute, INT8 per-channel weights. | 304 MB |
|
| 33 |
+
| `qwen3_asr_decoder_f32_anemll_int8-mixed.mlpackage/` | Decoder (28-layer causal LLM + LM head). FP32 compute, ANEMLL RMSNorm, mixed-precision INT8. | 2.8 GB |
|
| 34 |
+
| `qwen3_asr_embeddings.bin` | Token embedding table (FP16, shape 151936 x 2048). Loaded separately to avoid quantizing embeddings. | 594 MB |
|
| 35 |
+
|
| 36 |
+
## Architecture
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
Audio (16kHz) β Mel Spectrogram (128 bins)
|
| 40 |
+
β Encoder (100-frame chunks β 13 tokens each)
|
| 41 |
+
β Embedding lookup + prompt construction
|
| 42 |
+
β Decoder (autoregressive, RoPE as inputs, KV cache)
|
| 43 |
+
β Token IDs β Text
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Encoder
|
| 47 |
+
|
| 48 |
+
- 3x Conv2D (stride 2) + 24-layer Transformer + 2-layer projection
|
| 49 |
+
- Input: `[1, 1, 128, 100]` mel chunk β Output: `[1, 13, 2048]` audio features
|
| 50 |
+
- Bidirectional attention, sinusoidal positional embeddings (reset per chunk)
|
| 51 |
+
- INT8 per-channel quantization (first conv layer kept at FP16)
|
| 52 |
+
|
| 53 |
+
### Decoder
|
| 54 |
+
|
| 55 |
+
- 28-layer causal Transformer with GQA (16 Q heads, 8 KV heads, head_dim=128)
|
| 56 |
+
- SwiGLU MLP (intermediate_size=11008)
|
| 57 |
+
- RoPE theta=1,000,000 (NeoX/split-half style, passed as cos/sin inputs)
|
| 58 |
+
- ANEMLL-style RMSNorm: `LayerNorm([x, -x])` trick for numerical stability
|
| 59 |
+
- Mixed-precision INT8: first/last layer, norms, LM head kept at FP16
|
| 60 |
+
|
| 61 |
+
### Key Quality Improvements
|
| 62 |
+
|
| 63 |
+
1. **FP32 compute for decoder** β Hidden states reach 10,876 at layer 25; x^2 = 118M >> FP16 max 65,504. FP32 is mandatory.
|
| 64 |
+
2. **ANEMLL RMSNorm** β Uses native `layer_norm` op via `[x, -x]` concatenation for GPU/ANE precision.
|
| 65 |
+
3. **Mixed-precision INT8** β Skips 72 sensitive ops (embeddings, first/last layer, all norms, LM head).
|
| 66 |
+
4. **RoPE as inputs** β Precomputed sin/cos passed in to avoid in-model precision loss.
|
| 67 |
+
5. **Embeddings extracted** β 151,936-token embedding table stored as raw FP16 binary, not quantized.
|
| 68 |
+
|
| 69 |
+
## Quality Benchmarks
|
| 70 |
+
|
| 71 |
+
### English (JFK speech, 11s)
|
| 72 |
+
|
| 73 |
+
| Pipeline | Transcription | Match |
|
| 74 |
+
|----------|---------------|-------|
|
| 75 |
+
| PyTorch FP32 | "And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." | Reference |
|
| 76 |
+
| CoreML INT8 | Same text (minor punctuation difference: "." vs ";") | **MATCH** |
|
| 77 |
+
|
| 78 |
+
### Taiwan Mandarin (Taiwan-Tongues test set, 20 samples, 60s total)
|
| 79 |
+
|
| 80 |
+
| Metric | Value |
|
| 81 |
+
|--------|-------|
|
| 82 |
+
| PyTorch vs CoreML match rate | **19/20 (95%)** |
|
| 83 |
+
| Average CER (PyTorch) | 0.2009 |
|
| 84 |
+
| Average CER (CoreML INT8) | 0.2064 |
|
| 85 |
+
| CER difference | **+0.0056** |
|
| 86 |
+
|
| 87 |
+
## Usage Notes
|
| 88 |
+
|
| 89 |
+
- **macOS 14+** required (minimum deployment target)
|
| 90 |
+
- Decoder runs on **GPU only** (FP32 compute prevents ANE execution)
|
| 91 |
+
- KV cache must be initialized with **1 dummy slot** (CoreML crashes on size-0 dynamic dims); mask the dummy position with `-inf`
|
| 92 |
+
- Encoder processes mel in **100-frame chunks** β split mel, run encoder per chunk, concatenate outputs
|
| 93 |
+
|
| 94 |
+
## Conversion Details
|
| 95 |
+
|
| 96 |
+
Converted using `coremltools 9.0` with PyTorch 2.10.0. Conversion scripts available at the source repository.
|
| 97 |
+
|
| 98 |
+
```
|
| 99 |
+
compute_precision: FLOAT32 (decoder), FLOAT16 (encoder)
|
| 100 |
+
minimum_deployment_target: macOS14
|
| 101 |
+
quantization: linear_symmetric INT8 per-channel (mixed-precision for decoder)
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## License
|
| 105 |
+
|
| 106 |
+
Apache 2.0 (same as base model)
|