qwen3.5-2B-CoreML / README.md
mlboydaisuke's picture
Upload README.md with huggingface_hub
1adc1ac verified
---
license: apache-2.0
library_name: coreml
base_model: Qwen/Qwen3.5-2B
tags:
- coreml
- apple-silicon
- ane
- on-device
- qwen3.5
- text-generation
pipeline_tag: text-generation
---
## Use it from Swift
<!-- swift-usage-begin -->
### Add the package
`Package.swift`:
```swift
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
```
Platforms: iOS 18+ / macOS 15+.
### Download + chat (one call)
```swift
import CoreMLLLM
// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
```
Multi-turn: keep an `[CoreMLLLM.Message]` array, append the
user/assistant turns, and pass the whole history to
`generate(_:)` again. Call `llm.reset()` to start a new
conversation (clears the KV cache).
<!-- swift-usage-end -->
# Qwen3.5-2B β€” Core ML (ANE chunked)
Core ML port of [`Qwen/Qwen3.5-2B`](https://huggingface.co/Qwen/Qwen3.5-2B), split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope.
**iPhone 17 Pro (A18) measured:** 17 tok/s decode, ~200 MB `phys_footprint`, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile β‰ˆ 15 min across chunks (cached after).
## Files
```
qwen3_5_2b_decode_chunks/
β”œβ”€β”€ chunk_a.mlpackage # 340 MB β€” embed + layers 0-5 + their states
β”œβ”€β”€ chunk_b.mlpackage # 340 MB β€” layers 6-11 + states
β”œβ”€β”€ chunk_c.mlpackage # 340 MB β€” layers 12-17 + states
β”œβ”€β”€ chunk_d.mlpackage # 850 MB β€” layers 18-23 + final_norm + lm_head
└── embed_weight.bin # 1.02 GB β€” raw fp16 embed table (248320 Γ— 2048)
```
**All 5 pieces are required.** They chain hidden→hidden across chunks per token, plus 48 state tensors (24 layers × 2 states each) carried inside the mlpackages.
The embed is **not** an mlpackage on purpose: Swift `mmap`s the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to `phys_footprint`.
## What this repo does NOT ship
- **No `model_config.json`** β€” Core ML serializes input/output shapes into each `.mlpackage` directly. `coremltools` loads it without external config.
- **No tokenizer** β€” fetch from the base model:
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
```
## Standalone usage (Python / Mac)
```python
import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML")
root = f"{local}/qwen3_5_2b_decode_chunks"
chunks = [
ct.models.MLModel(f"{root}/chunk_{x}.mlpackage")
for x in ("a", "b", "c", "d")
]
embed = np.memmap(f"{root}/embed_weight.bin",
dtype=np.float16, mode="r",
shape=(248320, 2048))
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
```
Per decode step:
1. Look up `embed[token_id]` β†’ `hidden (1, 1, 2048)` fp16
2. Pass `hidden` + scalar inputs (position, cos, sin) + state slice to `chunk_a.predict(...)`, take its `hidden_out` and updated states.
3. Repeat for `chunk_b`, `chunk_c`, `chunk_d`.
4. `chunk_d` emits `logits (1, 1, 248320)` fp16; argmax (or sample) it and feed back as `input_token` for the next step.
5. Map `new_state_*` outputs to the next call's `state_*` inputs.
Full reference Python loop: [`conversion/qwen35_2b_chunks_parity.py`](https://github.com/john-rocky/CoreML-LLM/blob/main/conversion/qwen35_2b_chunks_parity.py).
## iOS / Mac app
[`Qwen35Generator.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Examples/CoreMLLLMChat/CoreMLLLMChat/Qwen35Generator.swift) handles the chunk chaining + embed mmap. Tap **Qwen3.5 2B (ANE)** in the model picker.
## Architecture
Hybrid Gated DeltaNet + GQA, 24 layers, interleaved `[L L L F] Γ— 6`.
| | linear_attention | full_attention |
|---|---|---|
| count | 18 | 6 |
| state A | `(1, 6144, 4)` | `(1, 2, 2048, 256)` |
| state B | `(1, 16, 128, 128)` | `(1, 2, 2048, 256)` |
Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048.
## Conversion
```bash
python conversion/build_qwen35_2b_decode_chunks.py \
--out-dir ./output \
--max-seq 2048 --nbits 8
```
## License
Apache 2.0 (inherits from the base model).