| --- |
| license: gemma |
| base_model: google/gemma-4-E4B-it-qat-q4_0-unquantized |
| tags: |
| - coreai |
| - aimodel |
| - apple-silicon |
| - on-device |
| - gemma-4 |
| - qat |
| - gpu-pipelined |
| pipeline_tag: text-generation |
| --- |
| |
| # Gemma 4 E4B (text) β Apple Core AI (`.aimodel`) |
|
|
| **Gemma 4 E4B's text decoder converted to Apple's Core AI** (the Core ML successor announced |
| at WWDC26), running on iOS 27 / macOS 27 via Apple's `coreai-pipelined` GPU engine β **zero |
| custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and |
| the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe)**. |
|
|
| Converted **directly from Google's official QAT release** |
| [google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized): |
| bf16 weights **trained for q4_0 rounding**, and q4_0 *is* this bundle's quantization class |
| (per-block-32 absmax linear int4) β Google publishes these checkpoints as "preserving similar |
| quality to bfloat16", so this int4 conversion carries that guarantee **by design**, not by |
| post-hoc gating. |
| |
| > Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: |
| > **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)** β |
| > model card: [`zoo/gemma4-e4b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-e4b.md). |
|
|
| <!-- gen-cards:use-it begin id=gemma-4-e4b (managed by scripts/gen-cards β edit cards.json / QuickStart.swift, not this block) --> |
| ## Use it |
|
|
| βΆοΈ **Run it (source)** β the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo) |
| (GUI + CLI, one app for every chat model in the catalog): |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-kit |
| open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj |
| # β Run, then pick "Gemma 4 E4B" in the model picker |
| |
| # agents / headless (macOS): |
| cd coreai-kit/Examples/ChatDemo |
| swift run chat-cli --model gemma-4-e4b --prompt "What can you do, offline?" |
| ``` |
|
|
| π» **Build with it** β complete; the glue is kit API, copy-paste runs: |
|
|
| ```swift |
| import CoreAIKit |
| |
| let chat = try await ChatSession(catalog: "gemma-4-e4b") |
| let reply = try await chat.respond(to: prompt) |
| // reply: the answer, generated fully on-device |
| ``` |
|
|
| The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift) |
| β this exact code as one typed function, no UI; the CLI is an argument shell over it, and |
| the GUI drives the same `ChatSession` across turns for its transcript. |
| Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn β it keeps the |
| conversation history; `streamResponse(to:)` yields tokens as they decode. |
|
|
| **Integration checklist** |
|
|
| - SPM: `https://github.com/john-rocky/coreai-kit` β product **CoreAIKit** |
| - Info.plist: none needed |
| - Entitlements: none needed (macOS) |
| - First run downloads the model β 7.6 GB (Mac) β then it loads from the |
| local cache (Application Support; progress via the `downloadProgress` callback) |
| - Measure in Release β Debug is ~3Γ slower on per-token host work |
| <!-- gen-cards:use-it end --> |
|
|
| ## Measured (greedy; M4 Max / iPhone 17 Pro, settled device) |
|
|
| | config | files | size | M4 Max decode / prefill | iPhone decode / prefill | |
| |---|---|---|---|---| |
| | β
**provider** (runs BOTH platforms) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/` | 3.7 + 3.4 GB | 53.2 / 62.6 | **15.1 / 21.3** | |
| | β
**provider, iPhone-ready AOT** | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, **h18p = iPhone 17 Pro class only**) + the same tables | 3.7 + 3.4 GB | β | same as above β skip the AOT step | |
| | **tbl** (Mac-fastest) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.*` table files | 3.7 + 2.7 GB | **55.8** / 61.0 | not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) | |
|
|
| On iPhone the working set stays tiny β measured peak footprint **2.2 GB** (4.2 GB headroom): |
| the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases |
| land exactly on the bandwidth model (~2.1 GB int4/token). |
|
|
| ## What E4B is (config + checkpoint verified) |
|
|
| Clean **dense** model β no MoE. 42 layers (full attention every 6th), hidden 2560, |
| intermediate 10240 uniform, 8 query heads / **2 KV heads**, dual head_dim 256/512, 18 |
| KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded |
| KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in |
| `ios-frontend/gemma4_e4b_qat_gather_raw/`), final-logit softcap 30. The QAT checkpoint prunes |
| the never-used KV projections on the shared layers β the zoo's loader handles both layouts. |
| |
| ## Run contract (each item is load-bearing) |
| |
| Full story + traps: |
| [pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md). |
| |
| 1. Swift stack = `apple/coreai-models` + the zoo's patch stack |
| ([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), in |
| order). The β
provider bundle needs `EngineOptions.perTokenInputProvider` |
| (`coreai-pipelined-per-token-inputs.patch`); the tbl bundle needs |
| `EngineOptions.staticInputBuffers` (`coreai-pipelined-static-inputs.patch`). |
| 2. Provider mode: per token, fill `ple_tokens [1,1,42,256]` fp16 from the table dump β |
| `row = i8[id] * scale[id] * sqrt(256)`, mmap-gathered (~0.1 ms). tbl mode: bind |
| `ple_table` β `embed_per_layer.i8` and `ple_scale` β `embed_per_layer.scale.f32` as |
| **OWNED `storageModeShared` MTLBuffers** (buffer-backing traps in the knowledge page). |
| 3. `COREAI_CHUNK_THRESHOLD=1` **before** engine creation; **never call `engine.warmup()`** |
| (S=1 graph; a 1-token generate after load is the warmup). |
| 4. iPhone: **AOT is mandatory** (the 3.7 GB-constants graph crashes the on-device |
| specializer) β use the precompiled `_aotc_h18p/` bundle, or |
| `xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu |
| --architecture h18p --expect-frequent-reshapes` and point `metadata.json`'s |
| `assets.main` at the `.aimodelc`. Ship the |
| `com.apple.developer.kernel.increased-memory-limit` entitlement as headroom insurance, |
| and bench a **settled** device (a just-unlocked iPhone under-reads ~35%). |
|
|
| Reproduce from scratch (oracle + tables are checkpoint-derived β regenerate for any new |
| weights): [`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py) |
| with `--hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized`. |
|
|
| ## License |
|
|
| Gemma is provided under and subject to the **Gemma Terms of Use** |
| (https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of |
| [google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized); |
| by downloading or using them you agree to those terms, including the |
| [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
|
| Sibling repo (E2B, incl. its own official-QAT bundles): |
| [gemma-4-E2B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E2B-CoreAI). |
|
|