Use it from Swift
Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
Download + chat (one call)
import CoreMLLLM
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3-vl-2b-stateful-coreml")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }
With an image
import CoreGraphics
let cgImage: CGImage = ... // your CGImage
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user,
content: "What's in this image?")],
image: cgImage,
maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }
Qwen3-VL 2B β Core ML stateful (Phase 1)
Core ML port of Qwen/Qwen3-VL-2B-Instruct for iPhone / iPad / Mac Apple Neural Engine. Text + vision, INT8, 2.3 GB on disk.
iPhone 17 Pro (A18 ANE):
| Phase 1 (this repo) | v1.4.0 recurrent | |
|---|---|---|
| decode tok/s | 22β24 | ~10 |
| prefill tok/s (text) | ~260 effective | batched T=8 |
phys_footprint |
256β264 MB | ~1.7 GB |
| vision TTFT (first turn, ~200 prompt tokens) | ~2.7 s | ~5.5 s |
The 6Γ memory drop vs the recurrent path is the headline: KV cache lives inside ANE via MLState + slice_update so there is no silent GPU spill.
Files
qwen3_vl_2b_stateful_chunks/
βββ chunk_0.mlpackage β multifunction: infer (T=1) + prefill_b8 (T=8)
βββ chunk_1.mlpackage same
βββ chunk_2.mlpackage same
βββ chunk_3.mlpackage same
βββ chunk_0_vision.mlpackage β chunk_0 + DeepStack injection (multifunction)
βββ chunk_head.mlpackage β final_norm + lm_head + in-graph argmax
βββ embed_weight.bin β raw fp16 embed table (151936 Γ 2048), Swift mmaps it
qwen3_vl_2b_vision/
βββ vision.mlpackage β 448Γ448 β 196 tokens + 3 DeepStack taps
Each body chunk carries two Core ML functions sharing one ct.StateType named kv_cache_0 (shape (14, 8, 2048, 128) fp16):
inferβ T=1 decodeprefill_b8β T=8 batched prefill
Swift creates MLState once from the prefill model instance and re-uses it across both functions β Core ML binds state by name+shape, not by MLModel instance (per the ANEMLL Qwen3-1.7B recipe).
What this repo does NOT ship
- No
model_config.jsonβ Core ML packs shapes into each.mlpackage.coremltoolsreads them directly. - No tokenizer / processor β pull from the upstream model:
from transformers import AutoTokenizer, AutoProcessor
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
proc = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
Vision preprocessing: mean=std=0.5 (not CLIP defaults);
pixel_valuesshape is(3, 2, 448, 448)(pre-patchified). Seeconversion/build_qwen3_vl_2b_vision.py.
Standalone usage (Python / Mac)
import coremltools as ct, numpy as np
from huggingface_hub import snapshot_download
local = snapshot_download("mlboydaisuke/qwen3-vl-2b-stateful-coreml")
root = f"{local}/qwen3_vl_2b_stateful_chunks"
# Multifunction load β pick the function you need per step.
prefill_chunks = [ct.models.MLModel(
f"{root}/chunk_{i}.mlpackage", function_name="prefill_b8"
) for i in range(4)]
decode_chunks = [ct.models.MLModel(
f"{root}/chunk_{i}.mlpackage", function_name="infer"
) for i in range(4)]
head = ct.models.MLModel(f"{root}/chunk_head.mlpackage")
embed = np.memmap(f"{root}/embed_weight.bin",
dtype=np.float16, mode="r",
shape=(151936, 2048))
# State is created once and shared across infer + prefill_b8.
state = prefill_chunks[0].make_state()
Vision (image prompt only):
vision = ct.models.MLModel(
f"{local}/qwen3_vl_2b_vision/vision.mlpackage")
out = vision.predict({"pixel_values": img_3x2x448x448_fp16})
# out: {"hidden": (1, 196, 2048), "deepstack_5/11/17": (1, 196, 2048)}
iOS / Mac app
Swift runtime: Qwen3VL2BGenerator.swift. Tap Qwen3-VL 2B (stateful, Phase 1) in the picker.
Architecture
Standard 28-layer GQA text backbone: hidden=2048, num_heads=16, num_kv_heads=8, head_dim=128, vocab=151936, tie_word_embeddings=True, rope_theta=5e6, mRoPE section=[24,20,20] interleaved. 4 body chunks of 7 layers each. For text-only the mRoPE collapses to standard 1D RoPE.
Vision: fixed 448Γ448, 196 tokens after spatial_merge=2, DeepStack taps at vision layers 5/11/17 injected into text layers 0/1/2.
Not included
- KV state reuse across chat turns (Phase 2: replies after the first turn should be near-instant)
- App-launch prewarm
License
Apache 2.0 (inherits from the base model).
- Downloads last month
- 182
Model tree for mlboydaisuke/qwen3-vl-2b-stateful-coreml
Base model
Qwen/Qwen3-VL-2B-Instruct