| --- |
| license: apache-2.0 |
| library_name: coreml |
| base_model: Qwen/Qwen3.5-2B |
| tags: |
| - coreml |
| - apple-silicon |
| - ane |
| - on-device |
| - qwen3.5 |
| - text-generation |
| pipeline_tag: text-generation |
| --- |
| |
| ## Use it from Swift |
|
|
| <!-- swift-usage-begin --> |
| ### Add the package |
|
|
| `Package.swift`: |
|
|
| ```swift |
| .package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"), |
| |
| // In your target: |
| .product(name: "CoreMLLLM", package: "CoreML-LLM"), |
| ``` |
|
|
| Platforms: iOS 18+ / macOS 15+. |
|
|
| ### Download + chat (one call) |
|
|
| ```swift |
| import CoreMLLLM |
| |
| // First call pulls the bundle from this repo to Documents/Models/. |
| // Subsequent calls reuse the on-disk copy. |
| let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML") |
| |
| let stream = try await llm.generate( |
| [CoreMLLLM.Message(role: .user, content: "Hello!")], |
| maxTokens: 256 |
| ) |
| for await chunk in stream { |
| print(chunk, terminator: "") |
| } |
| ``` |
|
|
| Multi-turn: keep an `[CoreMLLLM.Message]` array, append the |
| user/assistant turns, and pass the whole history to |
| `generate(_:)` again. Call `llm.reset()` to start a new |
| conversation (clears the KV cache). |
| <!-- swift-usage-end --> |
|
|
|
|
|
|
| # Qwen3.5-2B β Core ML (ANE chunked) |
|
|
| Core ML port of [`Qwen/Qwen3.5-2B`](https://huggingface.co/Qwen/Qwen3.5-2B), split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope. |
|
|
| **iPhone 17 Pro (A18) measured:** 17 tok/s decode, ~200 MB `phys_footprint`, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile β 15 min across chunks (cached after). |
|
|
| ## Files |
|
|
| ``` |
| qwen3_5_2b_decode_chunks/ |
| βββ chunk_a.mlpackage # 340 MB β embed + layers 0-5 + their states |
| βββ chunk_b.mlpackage # 340 MB β layers 6-11 + states |
| βββ chunk_c.mlpackage # 340 MB β layers 12-17 + states |
| βββ chunk_d.mlpackage # 850 MB β layers 18-23 + final_norm + lm_head |
| βββ embed_weight.bin # 1.02 GB β raw fp16 embed table (248320 Γ 2048) |
| ``` |
|
|
| **All 5 pieces are required.** They chain hiddenβhidden across chunks per token, plus 48 state tensors (24 layers Γ 2 states each) carried inside the mlpackages. |
|
|
| The embed is **not** an mlpackage on purpose: Swift `mmap`s the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to `phys_footprint`. |
|
|
| ## What this repo does NOT ship |
|
|
| - **No `model_config.json`** β Core ML serializes input/output shapes into each `.mlpackage` directly. `coremltools` loads it without external config. |
| - **No tokenizer** β fetch from the base model: |
| |
| ```python |
| from transformers import AutoTokenizer |
| tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B") |
| ``` |
| |
| ## Standalone usage (Python / Mac) |
| |
| ```python |
| import coremltools as ct |
| import numpy as np |
| from huggingface_hub import snapshot_download |
| from transformers import AutoTokenizer |
| |
| local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML") |
| root = f"{local}/qwen3_5_2b_decode_chunks" |
| |
| chunks = [ |
| ct.models.MLModel(f"{root}/chunk_{x}.mlpackage") |
| for x in ("a", "b", "c", "d") |
| ] |
| embed = np.memmap(f"{root}/embed_weight.bin", |
| dtype=np.float16, mode="r", |
| shape=(248320, 2048)) |
| tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B") |
| ``` |
| |
| Per decode step: |
| 1. Look up `embed[token_id]` β `hidden (1, 1, 2048)` fp16 |
| 2. Pass `hidden` + scalar inputs (position, cos, sin) + state slice to `chunk_a.predict(...)`, take its `hidden_out` and updated states. |
| 3. Repeat for `chunk_b`, `chunk_c`, `chunk_d`. |
| 4. `chunk_d` emits `logits (1, 1, 248320)` fp16; argmax (or sample) it and feed back as `input_token` for the next step. |
| 5. Map `new_state_*` outputs to the next call's `state_*` inputs. |
| |
| Full reference Python loop: [`conversion/qwen35_2b_chunks_parity.py`](https://github.com/john-rocky/CoreML-LLM/blob/main/conversion/qwen35_2b_chunks_parity.py). |
| |
| ## iOS / Mac app |
| |
| [`Qwen35Generator.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Examples/CoreMLLLMChat/CoreMLLLMChat/Qwen35Generator.swift) handles the chunk chaining + embed mmap. Tap **Qwen3.5 2B (ANE)** in the model picker. |
|
|
| ## Architecture |
|
|
| Hybrid Gated DeltaNet + GQA, 24 layers, interleaved `[L L L F] Γ 6`. |
|
|
| | | linear_attention | full_attention | |
| |---|---|---| |
| | count | 18 | 6 | |
| | state A | `(1, 6144, 4)` | `(1, 2, 2048, 256)` | |
| | state B | `(1, 16, 128, 128)` | `(1, 2, 2048, 256)` | |
|
|
| Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048. |
|
|
| ## Conversion |
|
|
| ```bash |
| python conversion/build_qwen35_2b_decode_chunks.py \ |
| --out-dir ./output \ |
| --max-seq 2048 --nbits 8 |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 (inherits from the base model). |
|
|