Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3-vl-2b-stateful-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

With an image

import CoreGraphics

let cgImage: CGImage = ...   // your CGImage

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user,
                       content: "What's in this image?")],
    image: cgImage,
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

Qwen3-VL 2B β€” Core ML stateful (Phase 1)

Core ML port of Qwen/Qwen3-VL-2B-Instruct for iPhone / iPad / Mac Apple Neural Engine. Text + vision, INT8, 2.3 GB on disk.

iPhone 17 Pro (A18 ANE):

Phase 1 (this repo) v1.4.0 recurrent
decode tok/s 22–24 ~10
prefill tok/s (text) ~260 effective batched T=8
phys_footprint 256–264 MB ~1.7 GB
vision TTFT (first turn, ~200 prompt tokens) ~2.7 s ~5.5 s

The 6Γ— memory drop vs the recurrent path is the headline: KV cache lives inside ANE via MLState + slice_update so there is no silent GPU spill.

Files

qwen3_vl_2b_stateful_chunks/
β”œβ”€β”€ chunk_0.mlpackage           ← multifunction: infer (T=1) + prefill_b8 (T=8)
β”œβ”€β”€ chunk_1.mlpackage             same
β”œβ”€β”€ chunk_2.mlpackage             same
β”œβ”€β”€ chunk_3.mlpackage             same
β”œβ”€β”€ chunk_0_vision.mlpackage    ← chunk_0 + DeepStack injection (multifunction)
β”œβ”€β”€ chunk_head.mlpackage        ← final_norm + lm_head + in-graph argmax
└── embed_weight.bin            ← raw fp16 embed table (151936 Γ— 2048), Swift mmaps it

qwen3_vl_2b_vision/
└── vision.mlpackage            ← 448Γ—448 β†’ 196 tokens + 3 DeepStack taps

Each body chunk carries two Core ML functions sharing one ct.StateType named kv_cache_0 (shape (14, 8, 2048, 128) fp16):

  • infer β€” T=1 decode
  • prefill_b8 β€” T=8 batched prefill

Swift creates MLState once from the prefill model instance and re-uses it across both functions β€” Core ML binds state by name+shape, not by MLModel instance (per the ANEMLL Qwen3-1.7B recipe).

What this repo does NOT ship

  • No model_config.json β€” Core ML packs shapes into each .mlpackage. coremltools reads them directly.
  • No tokenizer / processor β€” pull from the upstream model:
from transformers import AutoTokenizer, AutoProcessor
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
proc = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

Vision preprocessing: mean=std=0.5 (not CLIP defaults); pixel_values shape is (3, 2, 448, 448) (pre-patchified). See conversion/build_qwen3_vl_2b_vision.py.

Standalone usage (Python / Mac)

import coremltools as ct, numpy as np
from huggingface_hub import snapshot_download

local = snapshot_download("mlboydaisuke/qwen3-vl-2b-stateful-coreml")
root  = f"{local}/qwen3_vl_2b_stateful_chunks"

# Multifunction load β€” pick the function you need per step.
prefill_chunks = [ct.models.MLModel(
    f"{root}/chunk_{i}.mlpackage", function_name="prefill_b8"
) for i in range(4)]
decode_chunks  = [ct.models.MLModel(
    f"{root}/chunk_{i}.mlpackage", function_name="infer"
) for i in range(4)]
head           = ct.models.MLModel(f"{root}/chunk_head.mlpackage")
embed          = np.memmap(f"{root}/embed_weight.bin",
                            dtype=np.float16, mode="r",
                            shape=(151936, 2048))

# State is created once and shared across infer + prefill_b8.
state = prefill_chunks[0].make_state()

Vision (image prompt only):

vision = ct.models.MLModel(
    f"{local}/qwen3_vl_2b_vision/vision.mlpackage")
out    = vision.predict({"pixel_values": img_3x2x448x448_fp16})
# out: {"hidden": (1, 196, 2048), "deepstack_5/11/17": (1, 196, 2048)}

iOS / Mac app

Swift runtime: Qwen3VL2BGenerator.swift. Tap Qwen3-VL 2B (stateful, Phase 1) in the picker.

Architecture

Standard 28-layer GQA text backbone: hidden=2048, num_heads=16, num_kv_heads=8, head_dim=128, vocab=151936, tie_word_embeddings=True, rope_theta=5e6, mRoPE section=[24,20,20] interleaved. 4 body chunks of 7 layers each. For text-only the mRoPE collapses to standard 1D RoPE.

Vision: fixed 448Γ—448, 196 tokens after spatial_merge=2, DeepStack taps at vision layers 5/11/17 injected into text layers 0/1/2.

Not included

  • KV state reuse across chat turns (Phase 2: replies after the first turn should be near-instant)
  • App-launch prewarm

License

Apache 2.0 (inherits from the base model).

Downloads last month
182
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/qwen3-vl-2b-stateful-coreml

Quantized
(67)
this model