---
license: apache-2.0
library_name: coreml
base_model: Qwen/Qwen3.5-2B
tags:
  - coreml
  - apple-silicon
  - ane
  - on-device
  - qwen3.5
  - text-generation
pipeline_tag: text-generation
---

## Use it from Swift

<!-- swift-usage-begin -->
### Add the package

`Package.swift`:

```swift
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
```

Platforms: iOS 18+ / macOS 15+.

### Download + chat (one call)

```swift
import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}
```

Multi-turn: keep an `[CoreMLLLM.Message]` array, append the
user/assistant turns, and pass the whole history to
`generate(_:)` again.  Call `llm.reset()` to start a new
conversation (clears the KV cache).
<!-- swift-usage-end -->


# Qwen3.5-2B — Core ML (ANE chunked)

Core ML port of [`Qwen/Qwen3.5-2B`](https://huggingface.co/Qwen/Qwen3.5-2B), split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope.

**iPhone 17 Pro (A18) measured:** 17 tok/s decode, ~200 MB `phys_footprint`, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile ≈ 15 min across chunks (cached after).

## Files

```
qwen3_5_2b_decode_chunks/
├── chunk_a.mlpackage      # 340 MB — embed + layers 0-5 + their states
├── chunk_b.mlpackage      # 340 MB — layers 6-11 + states
├── chunk_c.mlpackage      # 340 MB — layers 12-17 + states
├── chunk_d.mlpackage      # 850 MB — layers 18-23 + final_norm + lm_head
└── embed_weight.bin       # 1.02 GB — raw fp16 embed table (248320 × 2048)
```

**All 5 pieces are required.** They chain hidden→hidden across chunks per token, plus 48 state tensors (24 layers × 2 states each) carried inside the mlpackages.

The embed is **not** an mlpackage on purpose: Swift `mmap`s the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to `phys_footprint`.

## What this repo does NOT ship

- **No `model_config.json`** — Core ML serializes input/output shapes into each `.mlpackage` directly. `coremltools` loads it without external config.
- **No tokenizer** — fetch from the base model:

```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
```

## Standalone usage (Python / Mac)

```python
import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML")
root = f"{local}/qwen3_5_2b_decode_chunks"

chunks = [
    ct.models.MLModel(f"{root}/chunk_{x}.mlpackage")
    for x in ("a", "b", "c", "d")
]
embed = np.memmap(f"{root}/embed_weight.bin",
                  dtype=np.float16, mode="r",
                  shape=(248320, 2048))
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
```

Per decode step:
1. Look up `embed[token_id]` → `hidden (1, 1, 2048)` fp16
2. Pass `hidden` + scalar inputs (position, cos, sin) + state slice to `chunk_a.predict(...)`, take its `hidden_out` and updated states.
3. Repeat for `chunk_b`, `chunk_c`, `chunk_d`.
4. `chunk_d` emits `logits (1, 1, 248320)` fp16; argmax (or sample) it and feed back as `input_token` for the next step.
5. Map `new_state_*` outputs to the next call's `state_*` inputs.

Full reference Python loop: [`conversion/qwen35_2b_chunks_parity.py`](https://github.com/john-rocky/CoreML-LLM/blob/main/conversion/qwen35_2b_chunks_parity.py).

## iOS / Mac app

[`Qwen35Generator.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Examples/CoreMLLLMChat/CoreMLLLMChat/Qwen35Generator.swift) handles the chunk chaining + embed mmap. Tap **Qwen3.5 2B (ANE)** in the model picker.

## Architecture

Hybrid Gated DeltaNet + GQA, 24 layers, interleaved `[L L L F] × 6`.

| | linear_attention | full_attention |
|---|---|---|
| count | 18 | 6 |
| state A | `(1, 6144, 4)` | `(1, 2, 2048, 256)` |
| state B | `(1, 16, 128, 128)` | `(1, 2, 2048, 256)` |

Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048.

## Conversion

```bash
python conversion/build_qwen35_2b_decode_chunks.py \
  --out-dir ./output \
  --max-seq 2048 --nbits 8
```

## License

Apache 2.0 (inherits from the base model).