Canary-Qwen 2.5B β€” Static CoreML Decoder (FP32)

Static-shape prefill/decode CoreML models for the decoder component of nvidia/canary-qwen-2.5b, targeting Apple Silicon (ANE/GPU) inference in dictation apps.

How this differs from the dynamic-prompt versions

coreml-int8 (dynamic, shipped) coreml-static (this repo)
Decoder graph Flexible query length via RangeDim Fixed-shape prefill (164 tokens) + single-token decode
Prompt handling Dynamic prompt length, CoreML specializes at runtime Baked-in for Muesli/Swift prompt shape (33 text + 126 audio + 5 text = 164)
Cache update Slice assignment with dynamic end-step One-hot mask scatter (CoreML-friendly, no dynamic indexing)
Quantization INT8 (W8A8) FP32 (not yet quantized)
Cold start Slower β€” CoreML must specialize the flexible graph on first run Faster β€” shapes are fixed, no specialization needed
Use case General-purpose, any prompt shape Dictation-optimized for a known prompt/audio regime

Why static graphs?

The flexible RangeDim decoder caused CoreML cold-start specialization pain and earlier lowering failures. Static prefill+decode graphs are much more CoreML-friendly for a fixed dictation prompt/audio regime. The trade-off is that if the Swift prompt structure changes, the models need to be re-traced.

Contents

canary_prefill_static.mlpackage/   β€” Fixed-shape prefill (164 tokens)
canary_decode_static.mlpackage/    β€” Fixed-shape single-token decode with KV cache

Architecture

  • Base model: Qwen/Qwen3-1.7B (decoder only β€” encoder is separate)
  • LoRA: nvidia/canary-qwen-2.5b LoRA merged at scale 256/128
  • Layers: 28 transformer layers, hidden=2048, GQA with 8 KV heads, head_dim=128
  • Max sequence length: 256
  • Static prefill length: 164 (Muesli/Swift prompt shape)
  • KV cache state dtype: float16
  • LM head: separate artifact (not fused β€” fused path was slower in benchmarks)
  • RoPE theta: 1,000,000

Prefill inputs

Name Shape Type
hidden_states [1, 164, 2048] float32

Decode inputs

Name Shape Type
hidden_states [1, 1, 2048] float32
cache_update_mask [1, 256] float32 (one-hot at current position)
cache_valid_mask [1, 256] float32 (1.0 for valid cached positions)

Provenance

  • Traced on Modal (CPU-only, torch 2.6.0) from nvidia/canary-qwen-2.5b safetensors
  • Converted to CoreML on macOS with coremltools 9.0
  • Trace run: 20260327-020004-static-prefill-164-sliceassign
  • Cache update uses slice-assignment pattern (earlier .copy_() pattern failed CoreML conversion)

Related repos

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phequals/canary-qwen-2.5b-coreml-static

Finetuned
Qwen/Qwen3-1.7B
Quantized
(3)
this model