Canary-Qwen 2.5B — Static CoreML Decoder (FP32)

Static-shape prefill/decode CoreML models for the decoder component of nvidia/canary-qwen-2.5b, targeting Apple Silicon (ANE/GPU) inference in dictation apps.

How this differs from the dynamic-prompt versions

	`coreml-int8` (dynamic, shipped)	`coreml-static` (this repo)
Decoder graph	Flexible query length via `RangeDim`	Fixed-shape prefill (164 tokens) + single-token decode
Prompt handling	Dynamic prompt length, CoreML specializes at runtime	Baked-in for Muesli/Swift prompt shape (33 text + 126 audio + 5 text = 164)
Cache update	Slice assignment with dynamic end-step	One-hot mask scatter (CoreML-friendly, no dynamic indexing)
Quantization	INT8 (W8A8)	FP32 (not yet quantized)
Cold start	Slower — CoreML must specialize the flexible graph on first run	Faster — shapes are fixed, no specialization needed
Use case	General-purpose, any prompt shape	Dictation-optimized for a known prompt/audio regime

Why static graphs?

The flexible RangeDim decoder caused CoreML cold-start specialization pain and earlier lowering failures. Static prefill+decode graphs are much more CoreML-friendly for a fixed dictation prompt/audio regime. The trade-off is that if the Swift prompt structure changes, the models need to be re-traced.

canary_prefill_static.mlpackage/   — Fixed-shape prefill (164 tokens)
canary_decode_static.mlpackage/    — Fixed-shape single-token decode with KV cache

Architecture

Base model: Qwen/Qwen3-1.7B (decoder only — encoder is separate)
LoRA: nvidia/canary-qwen-2.5b LoRA merged at scale 256/128
Layers: 28 transformer layers, hidden=2048, GQA with 8 KV heads, head_dim=128
Max sequence length: 256
Static prefill length: 164 (Muesli/Swift prompt shape)
KV cache state dtype: float16
LM head: separate artifact (not fused — fused path was slower in benchmarks)
RoPE theta: 1,000,000

Prefill inputs

Name	Shape	Type
`hidden_states`	`[1, 164, 2048]`	float32

Decode inputs

Name	Shape	Type
`hidden_states`	`[1, 1, 2048]`	float32
`cache_update_mask`	`[1, 256]`	float32 (one-hot at current position)
`cache_valid_mask`	`[1, 256]`	float32 (1.0 for valid cached positions)

Provenance

Traced on Modal (CPU-only, torch 2.6.0) from nvidia/canary-qwen-2.5b safetensors
Converted to CoreML on macOS with coremltools 9.0
Trace run: 20260327-020004-static-prefill-164-sliceassign
Cache update uses slice-assignment pattern (earlier .copy_() pattern failed CoreML conversion)

Related repos

phequals/canary-qwen-2.5b-coreml-int8 — shipped dynamic-prompt INT8 version
phequals/canary-qwen-2.5b-coreml-fp16 — earlier dynamic-prompt FP16 version

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phequals/canary-qwen-2.5b-coreml-static

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

nvidia/canary-qwen-2.5b

Quantized

(3)

this model

phequals
/

canary-qwen-2.5b-coreml-static