gen-cards: regenerate Use-it block

b3a45e9 verified 1 day ago

15 kB

	---
	license: gemma
	base_model: google/gemma-4-E2B-it
	tags:
	- coreai
	- aimodel
	- apple-silicon
	- ane
	- on-device
	- gemma-4
	- custom-metal-kernels
	- gpu-pipelined
	pipeline_tag: text-generation
	---

	# Gemma 4 E2B (text) — Apple Core AI (`.aimodel`)

	Gemma 4 E2B's text decoder converted to Apple's Core AI (the Core ML successor announced at
	WWDC26), ready to run on iOS 27 / macOS 27 — **greedy 8/8 exact vs the Hugging Face reference on
	the iPhone GPU, the iPhone Neural Engine, and the Mac GPU.** The GPU bundles embed custom fused
	int8/int4 Metal kernels inside the `.aimodel` (a Core AI feature); the ANE bundles are
	kernel-free and numerically hardened for fp16 NPU execution.

	This repo publishes one set per platform × compute-unit: the best verified configuration —
	each file is the exact artifact behind the published numbers, nothing experimental — plus the
	**`gpu-pipelined/` fast path: ONE kernel-free graph that is the fastest decode on BOTH Mac and
	iPhone** (Apple's `coreai-pipelined` engine + the zoo's engine patch stack).

	> Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, Swift runner:
	> [coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo).

	<!-- gen-cards:use-it begin id=gemma-4-e2b (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) -->
	## Use it

	▶️ Run it (source) — the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo)
	(GUI + CLI, one app for every chat model in the catalog):

	```bash
	git clone https://github.com/john-rocky/coreai-kit
	open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
	# → Run, then pick "Gemma 4 E2B" in the model picker

	# agents / headless (macOS):
	cd coreai-kit/Examples/ChatDemo
	swift run chat-cli --model gemma-4-e2b --prompt "What can you do, offline?"
	```

	💻 Build with it — complete; the glue is kit API, copy-paste runs:

	```swift
	import CoreAIKit

	let chat = try await ChatSession(catalog: "gemma-4-e2b")
	let reply = try await chat.respond(to: prompt)
	// reply: the answer, generated fully on-device
	```

	The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift)
	— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
	the GUI drives the same `ChatSession` across turns for its transcript.
	Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn — it keeps the
	conversation history; `streamResponse(to:)` yields tokens as they decode.

	Integration checklist

	- SPM: `https://github.com/john-rocky/coreai-kit` → product CoreAIKit
	- Info.plist: none needed
	- Entitlements: none needed (macOS)
	- First run downloads the model — 4.9 GB (Mac) — then it loads from the
	local cache (Application Support; progress via the `downloadProgress` callback)
	- Measure in Release — Debug is ~3× slower on per-token host work
	<!-- gen-cards:use-it end -->

	## Pick your platform (measured: iPhone 17 Pro / M4 Max, greedy, 8/8 exact vs HF)

	\| Category \| Files \| Size \| Decode \|
	\|---\|---\|---\|---\|
	\| iOS GPU \| `ios-frontend/gemma4_gather_raw/` + `ios-gpu/gemma4_e2b_metal_int4km_L35.aimodel` + `ios-gpu/gemma4_e2b_head_argmax_int4km.aimodel` \| 2.6 + 1.3 + 0.2 GB \| 22 tok/s \|
	\| iOS ANE \| `ios-frontend/gemma4_gather_raw/` + `ios-ane/gemma4_e2b_hostcache_chunk{1..6}_int8.aimodel` + `ios-ane/gemma4_e2b_head_argmax_int8.aimodel` (+ `gemma4_chunks_plan.json`) \| 2.6 + 1.8 + 0.4 GB \| 6 tok/s \|
	\| macOS GPU \| `macos/gemma4_e2b_frontend_int8.aimodel` + `macos/gemma4_e2b_metal_int8v3_L35.aimodel` + `macos/gemma4_e2b_head_argmax_kernel.aimodel` \| 2.6 + 2.0 + 0.4 GB \| 56.6–59.0 tok/s (release build) \|
	\| ★ GPU pipelined (Mac + iOS) \| `gpu-pipelined/gemma4_e2b_decode_int4lin_tbl/` + `ios-frontend/gemma4_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32}` \| 2.0 + 2.4 GB \| 77.0 tok/s (M4 Max) · 30.3 tok/s (iPhone 17 Pro, AOT) \|
	\| ★ GPU pipelined, iPhone-ready AOT \| `gpu-pipelined/gemma4_e2b_decode_int4lin_tbl_aotc_h18p/` (precompiled `.aimodelc`, h18p = iPhone 17 Pro class only) + the same two `gemma4_gather_raw` table files \| 2.0 + 2.4 GB \| same as above on iPhone — skip the AOT step \|
	\| ★★ GPU pipelined, official-QAT int4 \| `gpu-pipelined/gemma4_e2b_qat_decode_int4lin_tbl/` (+ `…_tbl_aotc_h18p/` precompiled) + `ios-frontend/gemma4_qat_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32}` (QAT bundles need the QAT tables) \| 2.0 + 2.4 GB \| 78.9 (M4 Max) · 30.7 (iPhone) — same speed, int4 ≈ bf16 by design (see below) \|

	\| ★★★ VISION (VL): image+text → text \| `gpu-pipelined/gemma4_e2b_qat_vl_decode_int4linsym_tbl/` (Mac) or `…_vl_decode_int4linsym/` + `…_aotc_h18p/` (iPhone, provider+AOT) + `gpu-pipelined/gemma4_e2b_qat_vl_vision/` + the QAT tables \| 2.0 + 0.3 + 2.4 GB \| 82.4 (M4 Max) · 25.5 (iPhone) — the text decoder + a 3-line image splice \|

	(`ios-frontend/` is shared by both iPhone categories — download it once.)

	Architecture is a 3-stage flow (Gemma 4's giant embedding/PLE tables stay out of the graph):
	`frontend gather (mmap / int8 gather) → 35-layer decode core → 262k-vocab head(+argmax)`.

	- iOS GPU core = int4 k-means fused-kernel monolith (16-entry codebook staged in threadgroup
	memory, packed nibble loads); the head does the 262,144-vocab matvec and argmax in-kernel
	(returns (value,index) partials — no logits readback).
	- iOS ANE chunks = 6 fixed-shape chunks (the 35-layer monolith overflows the first-run ANE
	compile) with the two fp16 hardening fixes baked in: RMSNorm via the `LayerNorm([x,−x])`
	identity (fp32-accumulating LN kernel) and `Conv2d 1×1` projections (fp32 conv-engine MACs).
	- macOS core = int8 k-means fused-kernel monolith (uint32-packed index loads).


	## ★★★ Vision (Gemma 4 E2B VL) — image+text → text

	The same QAT checkpoint's vision path, riding the text decoder via the zoo's
	static-inputs patch — the image span is causal on E2B (verified vs the fp32 HF
	mask dump), so positions/masks/KV need nothing new:

	- `gpu-pipelined/gemma4_e2b_qat_vl_vision/` — fixed-grid vision encoder, run once
	per image: `patches [2304,768] f16 → image_embeds [256,1536]` (square 768×768 =
	48×48 patches = 256 soft tokens; ~100–170 ms).
	- Decoder: Mac = `gemma4_e2b_qat_vl_decode_int4linsym_tbl/` (tables in-graph,
	95.2 prefill / 82.4 decode tok/s on M4 Max). iPhone =
	`gemma4_e2b_qat_vl_decode_int4linsym{,_aotc_h18p}/` (provider mode — the tbl
	gather overflows an iOS per-encode scratch heap on this beta; **41.2 / 25.5
	tok/s** on iPhone 17 Pro, footprint 1.96 GB) + the
	`ios-frontend/gemma4_qat_gather_raw/` tables.
	- Host contract: rewrite the prompt's 256 `<image_soft_token>` ids to extension
	ids `V + slot`, bind `image_embeds [280,1536]` as a static buffer (square fills
	rows 0..255); provider mode maps extension ids → the PLE pad row. Quantization
	is plain absmax int4 (`--lin-sym`) — the QAT-q4_0 grid; clipping compounds
	errors at long contexts.

	Numerics: Mac engine ≡ python gate 24/24 token-for-token; margin-ruled exact
	vs the fp32 HF oracle (a flip only where the oracle's top-2 gap < 0.1). Details +
	conversion script: [`zoo/gemma4-vl.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-vl.md).

	## Run it

	Python (macOS 27): load with `coreai.runtime.AIModel` on the GPU delegate
	(`SpecializationOptions.from_preferred_compute_unit_kind(ComputeUnitKind.gpu())`), drive
	`frontend → core → head` per token. Swift/device: push the set into your app sandbox
	(`xcrun devicectl device copy to --domain-type appDataContainer`). Walkthroughs + the burned-in
	gotchas: [knowledge base](https://github.com/john-rocky/coreai-model-zoo/tree/main/knowledge) ·
	[Swift runtime notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/swift-runtime.md).
	Tokenizer: use the original [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)
	tokenizer files.

	Two device gotchas (measured on the beta, 2026-06-10):
	1. Verify each multi-GB copy completed (`xcrun devicectl device info files …`) before the
	app's first load — loading a partially-copied `.aimodel` poisons the on-device specialization
	cache for that content hash (later loads fail `ENOENT` even after the copy finishes).
	2. Optional AOT: `xcrun coreai-build compile <m>.aimodel --platform iOS --preferred-compute gpu
	--architecture h18p` → a `.aimodelc` that skips the on-device compile (first load ~4× faster,
	decode tok/s identical to the plain `.aimodel`). The arch name follows the device identifier, not the
	marketing name: iPhone 17 Pro = `iPhone18,1` → `h18p` (an `h17p` build fails to load with
	`invalidCompiledModel`).

	⚠️ Known beta issue affecting all Core AI LLMs (these bundles use the host-cache form that dodges
	it): [the KV-write bug page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/coreai-beta-mpsgraph-kvwrite-bug.md)
	(FB23024751 / [apple/coreai-models#5](https://github.com/apple/coreai-models/issues/5)).

	## ★ GPU-pipelined fast path (zero custom kernels) — `gpu-pipelined/`

	One decode-only S=1 LanguageBundle (`input_ids [1,1]` static, dynamic position/KV, embed +
	soft-capped head in-graph, and the 2.3 GB per-layer-embedding table as a STATIC graph input
	gathered in-graph by token id) rides Apple's `coreai-pipelined` engine: async non-blocking
	encode, on-GPU argmax, on-device KV growth. Measured (greedy; oracle 8/8, iPhone 24/24
	token-identical to Mac-GPU): M4 Max 77.0 decode / 87.1 prefill · iPhone 17 Pro 30.3 / 38.9
	— vs this repo's kernel monoliths (Mac 56.6–59, iPhone 22) with no Metal kernels at all.

	Run contract (each item is load-bearing — full story + traps in the zoo's
	[pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)):

	1. Swift stack = `apple/coreai-models` + the zoo's 4-patch stack
	([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), applied in
	order) — this bundle needs the `EngineOptions.staticInputBuffers` hook from
	`coreai-pipelined-static-inputs.patch`.
	2. Bind the two table files (download from `ios-frontend/gemma4_gather_raw/`) as static inputs:
	`ple_table` ← `embed_per_layer.i8`, `ple_scale` ← `embed_per_layer.scale.f32` — as **OWNED
	`storageModeShared` MTLBuffers** (read the file in once). A `PROT_READ`-only mmap under
	`makeBuffer(bytesNoCopy:)` silently costs ~65 ms/GB per encode on macOS; a writable COW
	mmap is fine on the Mac but pays a residency tax on iPhone.
	3. `COREAI_CHUNK_THRESHOLD=1` before engine creation (prefill = pipelined S=1 steps);
	never call `engine.warmup()` (it warms shape 256; the S=1 graph rejects it) — a 1-token
	generate after load is the warmup.
	4. iPhone: AOT first — `xcrun coreai-build compile <bundle>.aimodel --platform iOS
	--preferred-compute gpu --architecture h18p --expect-frequent-reshapes`, then point
	`metadata.json`'s `assets.main` at the `.aimodelc` (on this beta the plain bundle passes
	on-device specialization but the spec'd artifact asserts at first execute) — or download
	the precompiled `gpu-pipelined/gemma4_e2b_decode_int4lin_tbl_aotc_h18p/` (iPhone 17 Pro
	class). Ship the
	`com.apple.developer.kernel.increased-memory-limit` entitlement (the owned 2.35 GB table;
	measured peak footprint 4.4 GB vs a ~6.4 GB entitled limit) and bench a settled device
	(a just-unlocked iPhone under-reads ~35%).

	In-app: the zoo's [CoreAIChat](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat)
	ships this config as the Gemma ⚡ engine mode (GPU/ANE/⚡ segment) — it downloads the
	`_aotc_h18p` bundle plus the two table files and binds them as owned static buffers.
	Chat-surface on a settled iPhone 17 Pro: decode 32.7 / prefill 44.2 tok/s on a 200-token
	turn (vs 22 for the kernel-monolith GPU mode). First in-container load pays a one-time ~2 GB
	spec-cache ingest (~11 s engine load, ~6 s warm) and can invalidate sibling models' cached
	specializations once — the app's `GEMMA_CLEAR_SPEC_CACHE=1` hook recovers.

	The per-token-provider variant (PLE rows filled per step by a host callback —
	iPhone 26.5 decode / 40.5 prefill, no entitlement, clean mmap) is the lighter alternative;
	reproduce it from the same conversion script
	([`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py),
	drop `--tbl`).

	### ★★ Official QAT weights — int4 quality guaranteed by design

	`gpu-pipelined/gemma4_e2b_qat_decode_int4lin_tbl/` (+ the `_aotc_h18p/` precompile) is the
	same graph re-exported from Google's official QAT release
	[google/gemma-4-E2B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-unquantized):
	bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization
	(per-block-32 absmax-class linear int4). Google publishes these checkpoints as "preserving
	similar quality to bfloat16", explicitly for custom downstream compilation — so the int4
	claim here upgrades from "PTQ that gates 8/8" to int4 ≈ bf16 by design. Measured: same
	speed as the PTQ bundle (M4 Max 78.9 decode / 89.6 prefill; iPhone 17 Pro 30.7 / 36.7
	settled; oracle 8/8 on python, engine, and device).

	⚠️ Pair QAT bundles with the QAT tables: bind
	`ios-frontend/gemma4_qat_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32}` —
	the PLE table is checkpoint-derived, so the original `gemma4_gather_raw/` files do NOT
	match these weights. Everything else (patch stack, chunk threshold, entitlement, AOT)
	is identical to the PTQ run contract above. Gemma 4 E4B (the bigger sibling, also
	from official QAT weights) lives in its own repo:
	[gemma-4-E4B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E4B-CoreAI).

	## Parity

	All three sets reproduce the HF eager greedy reference 8/8 top-1 exact ("What is the capital
	of France?" → "The capital of France is Paris."), verified on macOS conversion and re-verified
	end-to-end on device per compute unit.

	## License

	Gemma is provided under and subject to the Gemma Terms of Use
	(https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of
	[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it); by downloading or using
	them you agree to those terms, including the
	[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).

	CoreML (iOS 18+) variants: [gemma-4-E2B-coreml](https://huggingface.co/mlboydaisuke/gemma-4-E2B-coreml) ·
	[gemma-4-E2B-stateful-coreml](https://huggingface.co/mlboydaisuke/gemma-4-E2B-stateful-coreml).