gen-cards: regenerate Use-it block

480fc1d verified 1 day ago

7.28 kB

	---
	license: gemma
	base_model: google/gemma-4-E4B-it-qat-q4_0-unquantized
	tags:
	- coreai
	- aimodel
	- apple-silicon
	- on-device
	- gemma-4
	- qat
	- gpu-pipelined
	pipeline_tag: text-generation
	---

	# Gemma 4 E4B (text) — Apple Core AI (`.aimodel`)

	Gemma 4 E4B's text decoder converted to Apple's Core AI (the Core ML successor announced
	at WWDC26), running on iOS 27 / macOS 27 via Apple's `coreai-pipelined` GPU engine — **zero
	custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and
	the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe)**.

	Converted directly from Google's official QAT release
	[google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized):
	bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization class
	(per-block-32 absmax linear int4) — Google publishes these checkpoints as "preserving similar
	quality to bfloat16", so this int4 conversion carries that guarantee by design, not by
	post-hoc gating.

	> Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack:
	> [coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo) —
	> model card: [`zoo/gemma4-e4b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-e4b.md).

	<!-- gen-cards:use-it begin id=gemma-4-e4b (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) -->
	## Use it

	▶️ Run it (source) — the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo)
	(GUI + CLI, one app for every chat model in the catalog):

	```bash
	git clone https://github.com/john-rocky/coreai-kit
	open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
	# → Run, then pick "Gemma 4 E4B" in the model picker

	# agents / headless (macOS):
	cd coreai-kit/Examples/ChatDemo
	swift run chat-cli --model gemma-4-e4b --prompt "What can you do, offline?"
	```

	💻 Build with it — complete; the glue is kit API, copy-paste runs:

	```swift
	import CoreAIKit

	let chat = try await ChatSession(catalog: "gemma-4-e4b")
	let reply = try await chat.respond(to: prompt)
	// reply: the answer, generated fully on-device
	```

	The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift)
	— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
	the GUI drives the same `ChatSession` across turns for its transcript.
	Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn — it keeps the
	conversation history; `streamResponse(to:)` yields tokens as they decode.

	Integration checklist

	- SPM: `https://github.com/john-rocky/coreai-kit` → product CoreAIKit
	- Info.plist: none needed
	- Entitlements: none needed (macOS)
	- First run downloads the model — 7.6 GB (Mac) — then it loads from the
	local cache (Application Support; progress via the `downloadProgress` callback)
	- Measure in Release — Debug is ~3× slower on per-token host work
	<!-- gen-cards:use-it end -->

	## Measured (greedy; M4 Max / iPhone 17 Pro, settled device)

	\| config \| files \| size \| M4 Max decode / prefill \| iPhone decode / prefill \|
	\|---\|---\|---\|---\|---\|
	\| ★ provider (runs BOTH platforms) \| `gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/` \| 3.7 + 3.4 GB \| 53.2 / 62.6 \| 15.1 / 21.3 \|
	\| ★ provider, iPhone-ready AOT \| `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, h18p = iPhone 17 Pro class only) + the same tables \| 3.7 + 3.4 GB \| — \| same as above — skip the AOT step \|
	\| tbl (Mac-fastest) \| `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.` table files \| 3.7 + 2.7 GB \| 55.8* / 61.0 \| not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) \|

	On iPhone the working set stays tiny — measured peak footprint 2.2 GB (4.2 GB headroom):
	the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases
	land exactly on the bandwidth model (~2.1 GB int4/token).

	## What E4B is (config + checkpoint verified)

	Clean dense model — no MoE. 42 layers (full attention every 6th), hidden 2560,
	intermediate 10240 uniform, 8 query heads / 2 KV heads, dual head_dim 256/512, 18
	KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded
	KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in
	`ios-frontend/gemma4_e4b_qat_gather_raw/`), final-logit softcap 30. The QAT checkpoint prunes
	the never-used KV projections on the shared layers — the zoo's loader handles both layouts.

	## Run contract (each item is load-bearing)

	Full story + traps:
	[pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).

	1. Swift stack = `apple/coreai-models` + the zoo's patch stack
	([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), in
	order). The ★ provider bundle needs `EngineOptions.perTokenInputProvider`
	(`coreai-pipelined-per-token-inputs.patch`); the tbl bundle needs
	`EngineOptions.staticInputBuffers` (`coreai-pipelined-static-inputs.patch`).
	2. Provider mode: per token, fill `ple_tokens [1,1,42,256]` fp16 from the table dump —
	`row = i8[id] * scale[id] * sqrt(256)`, mmap-gathered (~0.1 ms). tbl mode: bind
	`ple_table` ← `embed_per_layer.i8` and `ple_scale` ← `embed_per_layer.scale.f32` as
	OWNED `storageModeShared` MTLBuffers (buffer-backing traps in the knowledge page).
	3. `COREAI_CHUNK_THRESHOLD=1` before engine creation; never call `engine.warmup()`
	(S=1 graph; a 1-token generate after load is the warmup).
	4. iPhone: AOT is mandatory (the 3.7 GB-constants graph crashes the on-device
	specializer) — use the precompiled `_aotc_h18p/` bundle, or
	`xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu
	--architecture h18p --expect-frequent-reshapes` and point `metadata.json`'s
	`assets.main` at the `.aimodelc`. Ship the
	`com.apple.developer.kernel.increased-memory-limit` entitlement as headroom insurance,
	and bench a settled device (a just-unlocked iPhone under-reads ~35%).

	Reproduce from scratch (oracle + tables are checkpoint-derived — regenerate for any new
	weights): [`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py)
	with `--hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized`.

	## License

	Gemma is provided under and subject to the Gemma Terms of Use
	(https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of
	[google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized);
	by downloading or using them you agree to those terms, including the
	[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).

	Sibling repo (E2B, incl. its own official-QAT bundles):
	[gemma-4-E2B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E2B-CoreAI).