Use it from Swift
Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
Download + chat (one call)
import CoreMLLLM
// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-0.8B-CoreML")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
Multi-turn: keep an [CoreMLLLM.Message] array, append the
user/assistant turns, and pass the whole history to
generate(_:) again. Call llm.reset() to start a new
conversation (clears the KV cache).
Qwen3.5-0.8B β Core ML (ANE / GPU / CPU)
Core ML port of Qwen/Qwen3.5-0.8B (Gated DeltaNet hybrid SSM + attention) optimized for Apple Silicon.
iPhone 17 Pro (A18) measured:
| config | prefill tok/s | decode tok/s | Metal heap | vs LiteRT-LM |
|---|---|---|---|---|
| CPU+ANE (recommended) | 170.5 | 22.1 | 0 GB | prefill 3.0Γ |
| CPU+GPU (bit-exact) | 267.6 | 27.7 | ~3 GB | prefill 4.7Γ |
| CPU only | broken* | 20.3 | 0 GB | β |
*CPU prefill hits an iOS 26.1 Core ML runtime bug on this graph.
Files
| File | Size | Role |
|---|---|---|
qwen3_5_0_8b_decode_int8_mseq128.mlpackage |
754 MB | Default decode β INT8 palettized, top-3 parity vs fp32 oracle. Load this. |
qwen3_5_0_8b_decode_fp16_mseq128.mlpackage |
1.5 GB | fp16 ground truth decode. Use only for parity debugging. |
qwen3_5_0_8b_prefill_stateful_fp16_seq64.mlpackage |
1.5 GB | Optional stateful prefill (seq=64). End-to-end runner uses decode for both prefill+decode by default. |
qwen3_5_0_8b_fp16_seq64.mlpackage |
1.5 GB | Mac-only monolithic prefill+head. Reference; iPhone ANE compile budget rejects it. |
qwen3_5_chunk_a.mlpackage + qwen3_5_chunk_b.mlpackage |
750 MB total | Experimental 2-chunk prefill. Not used by the shipping runtime. |
Pick one mlpackage β they are independent runtimes for the same model. The default Swift app uses qwen3_5_0_8b_decode_int8_mseq128.mlpackage and runs both prefill and decode through it (single-token loop). All weights are inside the mlpackage; no external .bin sidecars.
What this repo does NOT ship
- No
model_config.jsonβ Core ML serializes the input/output shapes and dtypes directly into.mlpackage/Data/com.apple.CoreML/model.mlmodel, socoremltoolsloads it without an external config. You only need the architecture summary below if you are writing your own state-management loop. - No tokenizer β fetch from the base model:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")
The Swift app fetches it the same way via swift-transformers.
Standalone usage (Python / Mac)
import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
local = snapshot_download(
"mlboydaisuke/qwen3.5-0.8B-CoreML",
allow_patterns=["qwen3_5_0_8b_decode_int8_mseq128.mlpackage/*"],
)
mdl = ct.models.MLModel(f"{local}/qwen3_5_0_8b_decode_int8_mseq128.mlpackage")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")
# Inspect the input names + state shapes baked into the mlpackage:
for name, spec in mdl.input_description.items():
print(name, spec)
The model expects per-step inputs input_token (1,1) int32, position (1,) float32, cos/sin (rotary tables), and 48 state tensors (state_0_a..state_23_b). Layers i % 4 == 3 use full-attention KV state [1, 2, 128, 256]; the rest use linear-attention state pairs [1, 6144, 4] + [1, 16, 128, 128]. Initialize all states to zero, then chain step-by-step using the model's new_state_* outputs as the next call's state_* inputs.
For a complete reference loop in Python, see conversion/qwen35_2b_chunks_parity.py in the source repo (the chunked 2B parity harness β same state plumbing applies to 0.8B).
iOS / Mac app
Drop-in Swift runtime: Qwen35Generator.swift in john-rocky/CoreML-LLM. The model picker auto-downloads this repo when you tap Qwen3.5 0.8B (ANE).
Architecture
Hybrid (24 layers, interleaved [L L L F] Γ 6):
linear_attention (i % 4 != 3) |
full_attention (i % 4 == 3) |
|
|---|---|---|
| count | 18 | 6 |
| state shape A | (1, 6144, 4) (conv buffer) |
(1, 2, 128, 256) (KV) |
| state shape B | (1, 16, 128, 128) (rec) |
(1, 2, 128, 256) (KV) |
Hidden=1024, vocab=248320, num_kv=2 (full attn), head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=128.
Precision
ANE preserves the hidden state at cosine β₯ 0.999 vs fp32 over all 24 layers. Strict greedy top-1 looks "40%" because of argmax fragility on the 248K-vocab head β the fp32 oracle's top-1 token is in the ANE top-3 for 100% of tested positions on both Mac M4 and iPhone A18. Sampling-mode generation (temp>0, top-K, top-P) is effectively indistinguishable from fp32. Full investigation: docs/QWEN35_ROADMAP.md.
License
Apache 2.0 (inherits from the base model).
- Downloads last month
- 287