Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-0.8B-CoreML")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

Qwen3.5-0.8B — Core ML (ANE / GPU / CPU)

Core ML port of Qwen/Qwen3.5-0.8B (Gated DeltaNet hybrid SSM + attention) optimized for Apple Silicon.

iPhone 17 Pro (A18) measured:

config	prefill tok/s	decode tok/s	Metal heap	vs LiteRT-LM
CPU+ANE (recommended)	170.5	22.1	0 GB	prefill 3.0×
CPU+GPU (bit-exact)	267.6	27.7	~3 GB	prefill 4.7×
CPU only	broken*	20.3	0 GB	—

*CPU prefill hits an iOS 26.1 Core ML runtime bug on this graph.

Files

File	Size	Role
`qwen3_5_0_8b_decode_int8_mseq128.mlpackage`	754 MB	Default decode — INT8 palettized, top-3 parity vs fp32 oracle. Load this.
`qwen3_5_0_8b_decode_fp16_mseq128.mlpackage`	1.5 GB	fp16 ground truth decode. Use only for parity debugging.
`qwen3_5_0_8b_prefill_stateful_fp16_seq64.mlpackage`	1.5 GB	Optional stateful prefill (seq=64). End-to-end runner uses decode for both prefill+decode by default.
`qwen3_5_0_8b_fp16_seq64.mlpackage`	1.5 GB	Mac-only monolithic prefill+head. Reference; iPhone ANE compile budget rejects it.
`qwen3_5_chunk_a.mlpackage` + `qwen3_5_chunk_b.mlpackage`	750 MB total	Experimental 2-chunk prefill. Not used by the shipping runtime.

Pick one mlpackage — they are independent runtimes for the same model. The default Swift app uses qwen3_5_0_8b_decode_int8_mseq128.mlpackage and runs both prefill and decode through it (single-token loop). All weights are inside the mlpackage; no external .bin sidecars.

What this repo does NOT ship

No model_config.json — Core ML serializes the input/output shapes and dtypes directly into .mlpackage/Data/com.apple.CoreML/model.mlmodel, so coremltools loads it without an external config. You only need the architecture summary below if you are writing your own state-management loop.
No tokenizer — fetch from the base model:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")

The Swift app fetches it the same way via swift-transformers.

Standalone usage (Python / Mac)

import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download(
    "mlboydaisuke/qwen3.5-0.8B-CoreML",
    allow_patterns=["qwen3_5_0_8b_decode_int8_mseq128.mlpackage/*"],
)
mdl = ct.models.MLModel(f"{local}/qwen3_5_0_8b_decode_int8_mseq128.mlpackage")

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")

# Inspect the input names + state shapes baked into the mlpackage:
for name, spec in mdl.input_description.items():
    print(name, spec)

The model expects per-step inputs input_token (1,1) int32, position (1,) float32, cos/sin (rotary tables), and 48 state tensors (state_0_a..state_23_b). Layers i % 4 == 3 use full-attention KV state [1, 2, 128, 256]; the rest use linear-attention state pairs [1, 6144, 4] + [1, 16, 128, 128]. Initialize all states to zero, then chain step-by-step using the model's new_state_* outputs as the next call's state_* inputs.

For a complete reference loop in Python, see conversion/qwen35_2b_chunks_parity.py in the source repo (the chunked 2B parity harness — same state plumbing applies to 0.8B).

iOS / Mac app

Drop-in Swift runtime: Qwen35Generator.swift in john-rocky/CoreML-LLM. The model picker auto-downloads this repo when you tap Qwen3.5 0.8B (ANE).

Architecture

Hybrid (24 layers, interleaved [L L L F] × 6):

	linear_attention (`i % 4 != 3`)	full_attention (`i % 4 == 3`)
count	18	6
state shape A	`(1, 6144, 4)` (conv buffer)	`(1, 2, 128, 256)` (KV)
state shape B	`(1, 16, 128, 128)` (rec)	`(1, 2, 128, 256)` (KV)

Hidden=1024, vocab=248320, num_kv=2 (full attn), head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=128.

Precision

ANE preserves the hidden state at cosine ≥ 0.999 vs fp32 over all 24 layers. Strict greedy top-1 looks "40%" because of argmax fragility on the 248K-vocab head — the fp32 oracle's top-1 token is in the ANE top-3 for 100% of tested positions on both Mac M4 and iPhone A18. Sampling-mode generation (temp>0, top-K, top-P) is effectively indistinguishable from fp32. Full investigation: docs/QWEN35_ROADMAP.md.

License

Apache 2.0 (inherits from the base model).

Downloads last month: 287

Model tree for mlboydaisuke/qwen3.5-0.8B-CoreML

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Quantized

(106)

this model