Canary-Qwen 2.5B β Static CoreML Decoder (FP32)
Static-shape prefill/decode CoreML models for the decoder component of nvidia/canary-qwen-2.5b, targeting Apple Silicon (ANE/GPU) inference in dictation apps.
How this differs from the dynamic-prompt versions
coreml-int8 (dynamic, shipped) |
coreml-static (this repo) |
|
|---|---|---|
| Decoder graph | Flexible query length via RangeDim |
Fixed-shape prefill (164 tokens) + single-token decode |
| Prompt handling | Dynamic prompt length, CoreML specializes at runtime | Baked-in for Muesli/Swift prompt shape (33 text + 126 audio + 5 text = 164) |
| Cache update | Slice assignment with dynamic end-step | One-hot mask scatter (CoreML-friendly, no dynamic indexing) |
| Quantization | INT8 (W8A8) | FP32 (not yet quantized) |
| Cold start | Slower β CoreML must specialize the flexible graph on first run | Faster β shapes are fixed, no specialization needed |
| Use case | General-purpose, any prompt shape | Dictation-optimized for a known prompt/audio regime |
Why static graphs?
The flexible RangeDim decoder caused CoreML cold-start specialization pain and
earlier lowering failures. Static prefill+decode graphs are much more
CoreML-friendly for a fixed dictation prompt/audio regime. The trade-off is that
if the Swift prompt structure changes, the models need to be re-traced.
Contents
canary_prefill_static.mlpackage/ β Fixed-shape prefill (164 tokens)
canary_decode_static.mlpackage/ β Fixed-shape single-token decode with KV cache
Architecture
- Base model: Qwen/Qwen3-1.7B (decoder only β encoder is separate)
- LoRA: nvidia/canary-qwen-2.5b LoRA merged at scale 256/128
- Layers: 28 transformer layers, hidden=2048, GQA with 8 KV heads, head_dim=128
- Max sequence length: 256
- Static prefill length: 164 (Muesli/Swift prompt shape)
- KV cache state dtype: float16
- LM head: separate artifact (not fused β fused path was slower in benchmarks)
- RoPE theta: 1,000,000
Prefill inputs
| Name | Shape | Type |
|---|---|---|
hidden_states |
[1, 164, 2048] |
float32 |
Decode inputs
| Name | Shape | Type |
|---|---|---|
hidden_states |
[1, 1, 2048] |
float32 |
cache_update_mask |
[1, 256] |
float32 (one-hot at current position) |
cache_valid_mask |
[1, 256] |
float32 (1.0 for valid cached positions) |
Provenance
- Traced on Modal (CPU-only, torch 2.6.0) from
nvidia/canary-qwen-2.5bsafetensors - Converted to CoreML on macOS with
coremltools 9.0 - Trace run:
20260327-020004-static-prefill-164-sliceassign - Cache update uses slice-assignment pattern (earlier
.copy_()pattern failed CoreML conversion)
Related repos
phequals/canary-qwen-2.5b-coreml-int8β shipped dynamic-prompt INT8 versionphequals/canary-qwen-2.5b-coreml-fp16β earlier dynamic-prompt FP16 version
- Downloads last month
- 12
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support