Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-0.8B-CoreML")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

Qwen3.5-0.8B β€” Core ML (ANE / GPU / CPU)

Core ML port of Qwen/Qwen3.5-0.8B (Gated DeltaNet hybrid SSM + attention) optimized for Apple Silicon.

iPhone 17 Pro (A18) measured:

config prefill tok/s decode tok/s Metal heap vs LiteRT-LM
CPU+ANE (recommended) 170.5 22.1 0 GB prefill 3.0Γ—
CPU+GPU (bit-exact) 267.6 27.7 ~3 GB prefill 4.7Γ—
CPU only broken* 20.3 0 GB β€”

*CPU prefill hits an iOS 26.1 Core ML runtime bug on this graph.

Files

File Size Role
qwen3_5_0_8b_decode_int8_mseq128.mlpackage 754 MB Default decode β€” INT8 palettized, top-3 parity vs fp32 oracle. Load this.
qwen3_5_0_8b_decode_fp16_mseq128.mlpackage 1.5 GB fp16 ground truth decode. Use only for parity debugging.
qwen3_5_0_8b_prefill_stateful_fp16_seq64.mlpackage 1.5 GB Optional stateful prefill (seq=64). End-to-end runner uses decode for both prefill+decode by default.
qwen3_5_0_8b_fp16_seq64.mlpackage 1.5 GB Mac-only monolithic prefill+head. Reference; iPhone ANE compile budget rejects it.
qwen3_5_chunk_a.mlpackage + qwen3_5_chunk_b.mlpackage 750 MB total Experimental 2-chunk prefill. Not used by the shipping runtime.

Pick one mlpackage β€” they are independent runtimes for the same model. The default Swift app uses qwen3_5_0_8b_decode_int8_mseq128.mlpackage and runs both prefill and decode through it (single-token loop). All weights are inside the mlpackage; no external .bin sidecars.

What this repo does NOT ship

  • No model_config.json β€” Core ML serializes the input/output shapes and dtypes directly into .mlpackage/Data/com.apple.CoreML/model.mlmodel, so coremltools loads it without an external config. You only need the architecture summary below if you are writing your own state-management loop.
  • No tokenizer β€” fetch from the base model:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")

The Swift app fetches it the same way via swift-transformers.

Standalone usage (Python / Mac)

import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download(
    "mlboydaisuke/qwen3.5-0.8B-CoreML",
    allow_patterns=["qwen3_5_0_8b_decode_int8_mseq128.mlpackage/*"],
)
mdl = ct.models.MLModel(f"{local}/qwen3_5_0_8b_decode_int8_mseq128.mlpackage")

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-0.8B")

# Inspect the input names + state shapes baked into the mlpackage:
for name, spec in mdl.input_description.items():
    print(name, spec)

The model expects per-step inputs input_token (1,1) int32, position (1,) float32, cos/sin (rotary tables), and 48 state tensors (state_0_a..state_23_b). Layers i % 4 == 3 use full-attention KV state [1, 2, 128, 256]; the rest use linear-attention state pairs [1, 6144, 4] + [1, 16, 128, 128]. Initialize all states to zero, then chain step-by-step using the model's new_state_* outputs as the next call's state_* inputs.

For a complete reference loop in Python, see conversion/qwen35_2b_chunks_parity.py in the source repo (the chunked 2B parity harness β€” same state plumbing applies to 0.8B).

iOS / Mac app

Drop-in Swift runtime: Qwen35Generator.swift in john-rocky/CoreML-LLM. The model picker auto-downloads this repo when you tap Qwen3.5 0.8B (ANE).

Architecture

Hybrid (24 layers, interleaved [L L L F] Γ— 6):

linear_attention (i % 4 != 3) full_attention (i % 4 == 3)
count 18 6
state shape A (1, 6144, 4) (conv buffer) (1, 2, 128, 256) (KV)
state shape B (1, 16, 128, 128) (rec) (1, 2, 128, 256) (KV)

Hidden=1024, vocab=248320, num_kv=2 (full attn), head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=128.

Precision

ANE preserves the hidden state at cosine β‰₯ 0.999 vs fp32 over all 24 layers. Strict greedy top-1 looks "40%" because of argmax fragility on the 248K-vocab head β€” the fp32 oracle's top-1 token is in the ANE top-3 for 100% of tested positions on both Mac M4 and iPhone A18. Sampling-mode generation (temp>0, top-K, top-P) is effectively indistinguishable from fp32. Full investigation: docs/QWEN35_ROADMAP.md.

License

Apache 2.0 (inherits from the base model).

Downloads last month
287
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/qwen3.5-0.8B-CoreML

Quantized
(106)
this model