gemma-4-E4B-CoreAI / README.md
mlboydaisuke's picture
gen-cards: regenerate Use-it block
480fc1d verified
|
Raw
History Blame Contribute Delete
7.28 kB
---
license: gemma
base_model: google/gemma-4-E4B-it-qat-q4_0-unquantized
tags:
- coreai
- aimodel
- apple-silicon
- on-device
- gemma-4
- qat
- gpu-pipelined
pipeline_tag: text-generation
---
# Gemma 4 E4B (text) β€” Apple Core AI (`.aimodel`)
**Gemma 4 E4B's text decoder converted to Apple's Core AI** (the Core ML successor announced
at WWDC26), running on iOS 27 / macOS 27 via Apple's `coreai-pipelined` GPU engine β€” **zero
custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and
the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe)**.
Converted **directly from Google's official QAT release**
[google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized):
bf16 weights **trained for q4_0 rounding**, and q4_0 *is* this bundle's quantization class
(per-block-32 absmax linear int4) β€” Google publishes these checkpoints as "preserving similar
quality to bfloat16", so this int4 conversion carries that guarantee **by design**, not by
post-hoc gating.
> Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack:
> **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)** β€”
> model card: [`zoo/gemma4-e4b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-e4b.md).
<!-- gen-cards:use-it begin id=gemma-4-e4b (managed by scripts/gen-cards β€” edit cards.json / QuickStart.swift, not this block) -->
## Use it
▢️ **Run it (source)** β€” the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo)
(GUI + CLI, one app for every chat model in the catalog):
```bash
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# β†’ Run, then pick "Gemma 4 E4B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model gemma-4-e4b --prompt "What can you do, offline?"
```
πŸ’» **Build with it** β€” complete; the glue is kit API, copy-paste runs:
```swift
import CoreAIKit
let chat = try await ChatSession(catalog: "gemma-4-e4b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device
```
The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift)
β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same `ChatSession` across turns for its transcript.
Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn β€” it keeps the
conversation history; `streamResponse(to:)` yields tokens as they decode.
**Integration checklist**
- SPM: `https://github.com/john-rocky/coreai-kit` β†’ product **CoreAIKit**
- Info.plist: none needed
- Entitlements: none needed (macOS)
- First run downloads the model β€” 7.6 GB (Mac) β€” then it loads from the
local cache (Application Support; progress via the `downloadProgress` callback)
- Measure in Release β€” Debug is ~3Γ— slower on per-token host work
<!-- gen-cards:use-it end -->
## Measured (greedy; M4 Max / iPhone 17 Pro, settled device)
| config | files | size | M4 Max decode / prefill | iPhone decode / prefill |
|---|---|---|---|---|
| β˜… **provider** (runs BOTH platforms) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/` | 3.7 + 3.4 GB | 53.2 / 62.6 | **15.1 / 21.3** |
| β˜… **provider, iPhone-ready AOT** | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, **h18p = iPhone 17 Pro class only**) + the same tables | 3.7 + 3.4 GB | β€” | same as above β€” skip the AOT step |
| **tbl** (Mac-fastest) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.*` table files | 3.7 + 2.7 GB | **55.8** / 61.0 | not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) |
On iPhone the working set stays tiny β€” measured peak footprint **2.2 GB** (4.2 GB headroom):
the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases
land exactly on the bandwidth model (~2.1 GB int4/token).
## What E4B is (config + checkpoint verified)
Clean **dense** model β€” no MoE. 42 layers (full attention every 6th), hidden 2560,
intermediate 10240 uniform, 8 query heads / **2 KV heads**, dual head_dim 256/512, 18
KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded
KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in
`ios-frontend/gemma4_e4b_qat_gather_raw/`), final-logit softcap 30. The QAT checkpoint prunes
the never-used KV projections on the shared layers β€” the zoo's loader handles both layouts.
## Run contract (each item is load-bearing)
Full story + traps:
[pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).
1. Swift stack = `apple/coreai-models` + the zoo's patch stack
([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), in
order). The β˜… provider bundle needs `EngineOptions.perTokenInputProvider`
(`coreai-pipelined-per-token-inputs.patch`); the tbl bundle needs
`EngineOptions.staticInputBuffers` (`coreai-pipelined-static-inputs.patch`).
2. Provider mode: per token, fill `ple_tokens [1,1,42,256]` fp16 from the table dump β€”
`row = i8[id] * scale[id] * sqrt(256)`, mmap-gathered (~0.1 ms). tbl mode: bind
`ple_table` ← `embed_per_layer.i8` and `ple_scale` ← `embed_per_layer.scale.f32` as
**OWNED `storageModeShared` MTLBuffers** (buffer-backing traps in the knowledge page).
3. `COREAI_CHUNK_THRESHOLD=1` **before** engine creation; **never call `engine.warmup()`**
(S=1 graph; a 1-token generate after load is the warmup).
4. iPhone: **AOT is mandatory** (the 3.7 GB-constants graph crashes the on-device
specializer) β€” use the precompiled `_aotc_h18p/` bundle, or
`xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu
--architecture h18p --expect-frequent-reshapes` and point `metadata.json`'s
`assets.main` at the `.aimodelc`. Ship the
`com.apple.developer.kernel.increased-memory-limit` entitlement as headroom insurance,
and bench a **settled** device (a just-unlocked iPhone under-reads ~35%).
Reproduce from scratch (oracle + tables are checkpoint-derived β€” regenerate for any new
weights): [`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py)
with `--hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized`.
## License
Gemma is provided under and subject to the **Gemma Terms of Use**
(https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of
[google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized);
by downloading or using them you agree to those terms, including the
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
Sibling repo (E2B, incl. its own official-QAT bundles):
[gemma-4-E2B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E2B-CoreAI).