--- license: apache-2.0 library_name: coreml base_model: Qwen/Qwen3.5-2B tags: - coreml - apple-silicon - ane - on-device - qwen3.5 - text-generation pipeline_tag: text-generation --- ## Use it from Swift ### Add the package `Package.swift`: ```swift .package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"), // In your target: .product(name: "CoreMLLLM", package: "CoreML-LLM"), ``` Platforms: iOS 18+ / macOS 15+. ### Download + chat (one call) ```swift import CoreMLLLM // First call pulls the bundle from this repo to Documents/Models/. // Subsequent calls reuse the on-disk copy. let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML") let stream = try await llm.generate( [CoreMLLLM.Message(role: .user, content: "Hello!")], maxTokens: 256 ) for await chunk in stream { print(chunk, terminator: "") } ``` Multi-turn: keep an `[CoreMLLLM.Message]` array, append the user/assistant turns, and pass the whole history to `generate(_:)` again. Call `llm.reset()` to start a new conversation (clears the KV cache). # Qwen3.5-2B — Core ML (ANE chunked) Core ML port of [`Qwen/Qwen3.5-2B`](https://huggingface.co/Qwen/Qwen3.5-2B), split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope. **iPhone 17 Pro (A18) measured:** 17 tok/s decode, ~200 MB `phys_footprint`, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile ≈ 15 min across chunks (cached after). ## Files ``` qwen3_5_2b_decode_chunks/ ├── chunk_a.mlpackage # 340 MB — embed + layers 0-5 + their states ├── chunk_b.mlpackage # 340 MB — layers 6-11 + states ├── chunk_c.mlpackage # 340 MB — layers 12-17 + states ├── chunk_d.mlpackage # 850 MB — layers 18-23 + final_norm + lm_head └── embed_weight.bin # 1.02 GB — raw fp16 embed table (248320 × 2048) ``` **All 5 pieces are required.** They chain hidden→hidden across chunks per token, plus 48 state tensors (24 layers × 2 states each) carried inside the mlpackages. The embed is **not** an mlpackage on purpose: Swift `mmap`s the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to `phys_footprint`. ## What this repo does NOT ship - **No `model_config.json`** — Core ML serializes input/output shapes into each `.mlpackage` directly. `coremltools` loads it without external config. - **No tokenizer** — fetch from the base model: ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B") ``` ## Standalone usage (Python / Mac) ```python import coremltools as ct import numpy as np from huggingface_hub import snapshot_download from transformers import AutoTokenizer local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML") root = f"{local}/qwen3_5_2b_decode_chunks" chunks = [ ct.models.MLModel(f"{root}/chunk_{x}.mlpackage") for x in ("a", "b", "c", "d") ] embed = np.memmap(f"{root}/embed_weight.bin", dtype=np.float16, mode="r", shape=(248320, 2048)) tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B") ``` Per decode step: 1. Look up `embed[token_id]` → `hidden (1, 1, 2048)` fp16 2. Pass `hidden` + scalar inputs (position, cos, sin) + state slice to `chunk_a.predict(...)`, take its `hidden_out` and updated states. 3. Repeat for `chunk_b`, `chunk_c`, `chunk_d`. 4. `chunk_d` emits `logits (1, 1, 248320)` fp16; argmax (or sample) it and feed back as `input_token` for the next step. 5. Map `new_state_*` outputs to the next call's `state_*` inputs. Full reference Python loop: [`conversion/qwen35_2b_chunks_parity.py`](https://github.com/john-rocky/CoreML-LLM/blob/main/conversion/qwen35_2b_chunks_parity.py). ## iOS / Mac app [`Qwen35Generator.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Examples/CoreMLLLMChat/CoreMLLLMChat/Qwen35Generator.swift) handles the chunk chaining + embed mmap. Tap **Qwen3.5 2B (ANE)** in the model picker. ## Architecture Hybrid Gated DeltaNet + GQA, 24 layers, interleaved `[L L L F] × 6`. | | linear_attention | full_attention | |---|---|---| | count | 18 | 6 | | state A | `(1, 6144, 4)` | `(1, 2, 2048, 256)` | | state B | `(1, 16, 128, 128)` | `(1, 2, 2048, 256)` | Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048. ## Conversion ```bash python conversion/build_qwen35_2b_decode_chunks.py \ --out-dir ./output \ --max-seq 2048 --nbits 8 ``` ## License Apache 2.0 (inherits from the base model).