Upload README.md with huggingface_hub

1adc1ac verified about 1 month ago

4.86 kB

	---
	license: apache-2.0
	library_name: coreml
	base_model: Qwen/Qwen3.5-2B
	tags:
	- coreml
	- apple-silicon
	- ane
	- on-device
	- qwen3.5
	- text-generation
	pipeline_tag: text-generation
	---

	## Use it from Swift

	<!-- swift-usage-begin -->
	### Add the package

	`Package.swift`:

	```swift
	.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

	// In your target:
	.product(name: "CoreMLLLM", package: "CoreML-LLM"),
	```

	Platforms: iOS 18+ / macOS 15+.

	### Download + chat (one call)

	```swift
	import CoreMLLLM

	// First call pulls the bundle from this repo to Documents/Models/.
	// Subsequent calls reuse the on-disk copy.
	let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML")

	let stream = try await llm.generate(
	[CoreMLLLM.Message(role: .user, content: "Hello!")],
	maxTokens: 256
	)
	for await chunk in stream {
	print(chunk, terminator: "")
	}
	```

	Multi-turn: keep an `[CoreMLLLM.Message]` array, append the
	user/assistant turns, and pass the whole history to
	`generate(_:)` again. Call `llm.reset()` to start a new
	conversation (clears the KV cache).
	<!-- swift-usage-end -->



	# Qwen3.5-2B — Core ML (ANE chunked)

	Core ML port of [`Qwen/Qwen3.5-2B`](https://huggingface.co/Qwen/Qwen3.5-2B), split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope.

	iPhone 17 Pro (A18) measured: 17 tok/s decode, ~200 MB `phys_footprint`, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile ≈ 15 min across chunks (cached after).

	## Files

	```
	qwen3_5_2b_decode_chunks/
	├── chunk_a.mlpackage # 340 MB — embed + layers 0-5 + their states
	├── chunk_b.mlpackage # 340 MB — layers 6-11 + states
	├── chunk_c.mlpackage # 340 MB — layers 12-17 + states
	├── chunk_d.mlpackage # 850 MB — layers 18-23 + final_norm + lm_head
	└── embed_weight.bin # 1.02 GB — raw fp16 embed table (248320 × 2048)
	```

	All 5 pieces are required. They chain hidden→hidden across chunks per token, plus 48 state tensors (24 layers × 2 states each) carried inside the mlpackages.

	The embed is not an mlpackage on purpose: Swift `mmap`s the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to `phys_footprint`.

	## What this repo does NOT ship

	- No `model_config.json` — Core ML serializes input/output shapes into each `.mlpackage` directly. `coremltools` loads it without external config.
	- No tokenizer — fetch from the base model:

	```python
	from transformers import AutoTokenizer
	tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
	```

	## Standalone usage (Python / Mac)

	```python
	import coremltools as ct
	import numpy as np
	from huggingface_hub import snapshot_download
	from transformers import AutoTokenizer

	local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML")
	root = f"{local}/qwen3_5_2b_decode_chunks"

	chunks = [
	ct.models.MLModel(f"{root}/chunk_{x}.mlpackage")
	for x in ("a", "b", "c", "d")
	]
	embed = np.memmap(f"{root}/embed_weight.bin",
	dtype=np.float16, mode="r",
	shape=(248320, 2048))
	tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
	```

	Per decode step:
	1. Look up `embed[token_id]` → `hidden (1, 1, 2048)` fp16
	2. Pass `hidden` + scalar inputs (position, cos, sin) + state slice to `chunk_a.predict(...)`, take its `hidden_out` and updated states.
	3. Repeat for `chunk_b`, `chunk_c`, `chunk_d`.
	4. `chunk_d` emits `logits (1, 1, 248320)` fp16; argmax (or sample) it and feed back as `input_token` for the next step.
	5. Map `new_state_` outputs to the next call's `state_` inputs.

	Full reference Python loop: [`conversion/qwen35_2b_chunks_parity.py`](https://github.com/john-rocky/CoreML-LLM/blob/main/conversion/qwen35_2b_chunks_parity.py).

	## iOS / Mac app

	[`Qwen35Generator.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Examples/CoreMLLLMChat/CoreMLLLMChat/Qwen35Generator.swift) handles the chunk chaining + embed mmap. Tap Qwen3.5 2B (ANE) in the model picker.

	## Architecture

	Hybrid Gated DeltaNet + GQA, 24 layers, interleaved `[L L L F] × 6`.

	\| \| linear_attention \| full_attention \|
	\|---\|---\|---\|
	\| count \| 18 \| 6 \|
	\| state A \| `(1, 6144, 4)` \| `(1, 2, 2048, 256)` \|
	\| state B \| `(1, 16, 128, 128)` \| `(1, 2, 2048, 256)` \|

	Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048.

	## Conversion

	```bash
	python conversion/build_qwen35_2b_decode_chunks.py \
	--out-dir ./output \
	--max-seq 2048 --nbits 8
	```

	## License

	Apache 2.0 (inherits from the base model).