| --- |
| license: gemma |
| base_model: google/gemma-4-E2B-it |
| tags: |
| - coreai |
| - aimodel |
| - apple-silicon |
| - ane |
| - on-device |
| - gemma-4 |
| - custom-metal-kernels |
| - gpu-pipelined |
| pipeline_tag: text-generation |
| --- |
| |
| # Gemma 4 E2B (text) β Apple Core AI (`.aimodel`) |
|
|
| **Gemma 4 E2B's text decoder converted to Apple's Core AI** (the Core ML successor announced at |
| WWDC26), ready to run on iOS 27 / macOS 27 β **greedy 8/8 exact vs the Hugging Face reference on |
| the iPhone GPU, the iPhone Neural Engine, and the Mac GPU.** The GPU bundles embed custom fused |
| int8/int4 Metal kernels *inside* the `.aimodel` (a Core AI feature); the ANE bundles are |
| kernel-free and numerically hardened for fp16 NPU execution. |
|
|
| This repo publishes **one set per platform Γ compute-unit: the best verified configuration** β |
| each file is the exact artifact behind the published numbers, nothing experimental β plus the |
| **`gpu-pipelined/` fast path: ONE kernel-free graph that is the fastest decode on BOTH Mac and |
| iPhone** (Apple's `coreai-pipelined` engine + the zoo's engine patch stack). |
|
|
| > Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, Swift runner: |
| > **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)**. |
|
|
| <!-- gen-cards:use-it begin id=gemma-4-e2b (managed by scripts/gen-cards β edit cards.json / QuickStart.swift, not this block) --> |
| ## Use it |
|
|
| βΆοΈ **Run it (source)** β the [ChatDemo runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ChatDemo) |
| (GUI + CLI, one app for every chat model in the catalog): |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-kit |
| open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj |
| # β Run, then pick "Gemma 4 E2B" in the model picker |
| |
| # agents / headless (macOS): |
| cd coreai-kit/Examples/ChatDemo |
| swift run chat-cli --model gemma-4-e2b --prompt "What can you do, offline?" |
| ``` |
|
|
| π» **Build with it** β complete; the glue is kit API, copy-paste runs: |
|
|
| ```swift |
| import CoreAIKit |
| |
| let chat = try await ChatSession(catalog: "gemma-4-e2b") |
| let reply = try await chat.respond(to: prompt) |
| // reply: the answer, generated fully on-device |
| ``` |
|
|
| The take-home is [`Examples/ChatDemo/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ChatDemo/Sources/QuickStart.swift) |
| β this exact code as one typed function, no UI; the CLI is an argument shell over it, and |
| the GUI drives the same `ChatSession` across turns for its transcript. |
| Multi-turn? Hold the `ChatSession` and call `respond(to:)` per turn β it keeps the |
| conversation history; `streamResponse(to:)` yields tokens as they decode. |
|
|
| **Integration checklist** |
|
|
| - SPM: `https://github.com/john-rocky/coreai-kit` β product **CoreAIKit** |
| - Info.plist: none needed |
| - Entitlements: none needed (macOS) |
| - First run downloads the model β 4.9 GB (Mac) β then it loads from the |
| local cache (Application Support; progress via the `downloadProgress` callback) |
| - Measure in Release β Debug is ~3Γ slower on per-token host work |
| <!-- gen-cards:use-it end --> |
|
|
| ## Pick your platform (measured: iPhone 17 Pro / M4 Max, greedy, 8/8 exact vs HF) |
|
|
| | Category | Files | Size | Decode | |
| |---|---|---|---| |
| | **iOS GPU** | `ios-frontend/gemma4_gather_raw/` + `ios-gpu/gemma4_e2b_metal_int4km_L35.aimodel` + `ios-gpu/gemma4_e2b_head_argmax_int4km.aimodel` | 2.6 + 1.3 + 0.2 GB | **22 tok/s** | |
| | **iOS ANE** | `ios-frontend/gemma4_gather_raw/` + `ios-ane/gemma4_e2b_hostcache_chunk{1..6}_int8.aimodel` + `ios-ane/gemma4_e2b_head_argmax_int8.aimodel` (+ `gemma4_chunks_plan.json`) | 2.6 + 1.8 + 0.4 GB | **6 tok/s** | |
| | **macOS GPU** | `macos/gemma4_e2b_frontend_int8.aimodel` + `macos/gemma4_e2b_metal_int8v3_L35.aimodel` + `macos/gemma4_e2b_head_argmax_kernel.aimodel` | 2.6 + 2.0 + 0.4 GB | **56.6β59.0 tok/s** (release build) | |
| | β
**GPU pipelined (Mac + iOS)** | `gpu-pipelined/gemma4_e2b_decode_int4lin_tbl/` + `ios-frontend/gemma4_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32}` | 2.0 + 2.4 GB | **77.0 tok/s (M4 Max) Β· 30.3 tok/s (iPhone 17 Pro, AOT)** | |
| | β
**GPU pipelined, iPhone-ready AOT** | `gpu-pipelined/gemma4_e2b_decode_int4lin_tbl_aotc_h18p/` (precompiled `.aimodelc`, **h18p = iPhone 17 Pro class only**) + the same two `gemma4_gather_raw` table files | 2.0 + 2.4 GB | same as above on iPhone β skip the AOT step | |
| | β
β
**GPU pipelined, official-QAT int4** | `gpu-pipelined/gemma4_e2b_qat_decode_int4lin_tbl/` (+ `β¦_tbl_aotc_h18p/` precompiled) + **`ios-frontend/gemma4_qat_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32}`** (QAT bundles need the QAT tables) | 2.0 + 2.4 GB | **78.9 (M4 Max) Β· 30.7 (iPhone)** β same speed, **int4 β bf16 by design** (see below) | |
| |
| | β
β
β
**VISION (VL): image+text β text** | `gpu-pipelined/gemma4_e2b_qat_vl_decode_int4linsym_tbl/` (Mac) or `β¦_vl_decode_int4linsym/` + `β¦_aotc_h18p/` (iPhone, provider+AOT) + `gpu-pipelined/gemma4_e2b_qat_vl_vision/` + the QAT tables | 2.0 + 0.3 + 2.4 GB | **82.4 (M4 Max) Β· 25.5 (iPhone)** β the text decoder + a 3-line image splice | |
|
|
| (`ios-frontend/` is shared by both iPhone categories β download it once.) |
|
|
| Architecture is a **3-stage flow** (Gemma 4's giant embedding/PLE tables stay out of the graph): |
| `frontend gather (mmap / int8 gather) β 35-layer decode core β 262k-vocab head(+argmax)`. |
|
|
| - **iOS GPU core** = int4 k-means fused-kernel monolith (16-entry codebook staged in threadgroup |
| memory, packed nibble loads); the head does the 262,144-vocab matvec **and argmax in-kernel** |
| (returns (value,index) partials β no logits readback). |
| - **iOS ANE chunks** = 6 fixed-shape chunks (the 35-layer monolith overflows the first-run ANE |
| compile) with the two fp16 hardening fixes baked in: RMSNorm via the `LayerNorm([x,βx])` |
| identity (fp32-accumulating LN kernel) and `Conv2d 1Γ1` projections (fp32 conv-engine MACs). |
| - **macOS core** = int8 k-means fused-kernel monolith (uint32-packed index loads). |
|
|
|
|
| ## β
β
β
Vision (Gemma 4 E2B VL) β image+text β text |
|
|
| The **same QAT checkpoint's vision path**, riding the text decoder via the zoo's |
| static-inputs patch β the image span is **causal** on E2B (verified vs the fp32 HF |
| mask dump), so positions/masks/KV need nothing new: |
|
|
| - `gpu-pipelined/gemma4_e2b_qat_vl_vision/` β fixed-grid vision encoder, run once |
| per image: `patches [2304,768] f16 β image_embeds [256,1536]` (square 768Γ768 = |
| 48Γ48 patches = 256 soft tokens; ~100β170 ms). |
| - Decoder: **Mac** = `gemma4_e2b_qat_vl_decode_int4linsym_tbl/` (tables in-graph, |
| **95.2 prefill / 82.4 decode tok/s** on M4 Max). **iPhone** = |
| `gemma4_e2b_qat_vl_decode_int4linsym{,_aotc_h18p}/` (provider mode β the tbl |
| gather overflows an iOS per-encode scratch heap on this beta; **41.2 / 25.5 |
| tok/s** on iPhone 17 Pro, footprint 1.96 GB) + the |
| `ios-frontend/gemma4_qat_gather_raw/` tables. |
| - Host contract: rewrite the prompt's 256 `<image_soft_token>` ids to extension |
| ids `V + slot`, bind `image_embeds [280,1536]` as a static buffer (square fills |
| rows 0..255); provider mode maps extension ids β the PLE pad row. Quantization |
| is **plain absmax int4 (`--lin-sym`)** β the QAT-q4_0 grid; clipping compounds |
| errors at long contexts. |
| |
| Numerics: Mac engine β‘ python gate **24/24** token-for-token; margin-ruled exact |
| vs the fp32 HF oracle (a flip only where the oracle's top-2 gap < 0.1). Details + |
| conversion script: [`zoo/gemma4-vl.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-vl.md). |
| |
| ## Run it |
| |
| Python (macOS 27): load with `coreai.runtime.AIModel` on the GPU delegate |
| (`SpecializationOptions.from_preferred_compute_unit_kind(ComputeUnitKind.gpu())`), drive |
| `frontend β core β head` per token. Swift/device: push the set into your app sandbox |
| (`xcrun devicectl device copy to --domain-type appDataContainer`). Walkthroughs + the burned-in |
| gotchas: [knowledge base](https://github.com/john-rocky/coreai-model-zoo/tree/main/knowledge) Β· |
| [Swift runtime notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/swift-runtime.md). |
| Tokenizer: use the original [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) |
| tokenizer files. |
| |
| Two device gotchas (measured on the beta, 2026-06-10): |
| 1. **Verify each multi-GB copy completed** (`xcrun devicectl device info files β¦`) *before* the |
| app's first load β loading a partially-copied `.aimodel` poisons the on-device specialization |
| cache for that content hash (later loads fail `ENOENT` even after the copy finishes). |
| 2. Optional **AOT**: `xcrun coreai-build compile <m>.aimodel --platform iOS --preferred-compute gpu |
| --architecture h18p` β a `.aimodelc` that skips the on-device compile (first load ~4Γ faster, |
| decode tok/s identical to the plain `.aimodel`). The arch name follows the **device identifier**, not the |
| marketing name: iPhone 17 Pro = `iPhone18,1` β `h18p` (an `h17p` build fails to load with |
| `invalidCompiledModel`). |
| |
| β οΈ Known beta issue affecting all Core AI LLMs (these bundles use the host-cache form that dodges |
| it): [the KV-write bug page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/coreai-beta-mpsgraph-kvwrite-bug.md) |
| (FB23024751 / [apple/coreai-models#5](https://github.com/apple/coreai-models/issues/5)). |
| |
| ## β
GPU-pipelined fast path (zero custom kernels) β `gpu-pipelined/` |
| |
| One decode-only S=1 LanguageBundle (`input_ids [1,1]` static, dynamic position/KV, embed + |
| soft-capped head in-graph, and **the 2.3 GB per-layer-embedding table as a STATIC graph input** |
| gathered in-graph by token id) rides Apple's `coreai-pipelined` engine: async non-blocking |
| encode, on-GPU argmax, on-device KV growth. Measured (greedy; oracle 8/8, iPhone 24/24 |
| token-identical to Mac-GPU): **M4 Max 77.0 decode / 87.1 prefill Β· iPhone 17 Pro 30.3 / 38.9** |
| β vs this repo's kernel monoliths (Mac 56.6β59, iPhone 22) with no Metal kernels at all. |
|
|
| Run contract (each item is load-bearing β full story + traps in the zoo's |
| [pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)): |
|
|
| 1. Swift stack = `apple/coreai-models` + the zoo's **4-patch stack** |
| ([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), applied in |
| order) β this bundle needs the `EngineOptions.staticInputBuffers` hook from |
| `coreai-pipelined-static-inputs.patch`. |
| 2. Bind the two table files (download from `ios-frontend/gemma4_gather_raw/`) as static inputs: |
| `ple_table` β `embed_per_layer.i8`, `ple_scale` β `embed_per_layer.scale.f32` β as **OWNED |
| `storageModeShared` MTLBuffers** (read the file in once). A `PROT_READ`-only mmap under |
| `makeBuffer(bytesNoCopy:)` silently costs ~65 ms/GB *per encode* on macOS; a writable COW |
| mmap is fine on the Mac but pays a residency tax on iPhone. |
| 3. `COREAI_CHUNK_THRESHOLD=1` **before** engine creation (prefill = pipelined S=1 steps); |
| **never call `engine.warmup()`** (it warms shape 256; the S=1 graph rejects it) β a 1-token |
| generate after load is the warmup. |
| 4. iPhone: **AOT first** β `xcrun coreai-build compile <bundle>.aimodel --platform iOS |
| --preferred-compute gpu --architecture h18p --expect-frequent-reshapes`, then point |
| `metadata.json`'s `assets.main` at the `.aimodelc` (on this beta the plain bundle passes |
| on-device specialization but the spec'd artifact asserts at first execute) β or download |
| the precompiled `gpu-pipelined/gemma4_e2b_decode_int4lin_tbl_aotc_h18p/` (iPhone 17 Pro |
| class). Ship the |
| `com.apple.developer.kernel.increased-memory-limit` entitlement (the owned 2.35 GB table; |
| measured peak footprint 4.4 GB vs a ~6.4 GB entitled limit) and **bench a settled device** |
| (a just-unlocked iPhone under-reads ~35%). |
|
|
| **In-app**: the zoo's [CoreAIChat](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat) |
| ships this config as the Gemma **β‘** engine mode (GPU/ANE/β‘ segment) β it downloads the |
| `_aotc_h18p` bundle plus the two table files and binds them as owned static buffers. |
| Chat-surface on a settled iPhone 17 Pro: **decode 32.7 / prefill 44.2 tok/s** on a 200-token |
| turn (vs 22 for the kernel-monolith GPU mode). First in-container load pays a one-time ~2 GB |
| spec-cache ingest (~11 s engine load, ~6 s warm) and can invalidate sibling models' cached |
| specializations once β the app's `GEMMA_CLEAR_SPEC_CACHE=1` hook recovers. |
|
|
| The per-token-provider variant (PLE rows filled per step by a host callback β |
| iPhone 26.5 decode / 40.5 prefill, no entitlement, clean mmap) is the lighter alternative; |
| reproduce it from the same conversion script |
| ([`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py), |
| drop `--tbl`). |
|
|
| ### β
β
Official QAT weights β int4 quality guaranteed by design |
|
|
| `gpu-pipelined/gemma4_e2b_qat_decode_int4lin_tbl/` (+ the `_aotc_h18p/` precompile) is the |
| same graph re-exported from Google's official QAT release |
| [google/gemma-4-E2B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-unquantized): |
| bf16 weights **trained for q4_0 rounding**, and q4_0 *is* this bundle's quantization |
| (per-block-32 absmax-class linear int4). Google publishes these checkpoints as "preserving |
| similar quality to bfloat16", explicitly for custom downstream compilation β so the int4 |
| claim here upgrades from "PTQ that gates 8/8" to **int4 β bf16 by design**. Measured: same |
| speed as the PTQ bundle (M4 Max 78.9 decode / 89.6 prefill; iPhone 17 Pro 30.7 / 36.7 |
| settled; oracle 8/8 on python, engine, and device). |
| |
| β οΈ **Pair QAT bundles with the QAT tables**: bind |
| `ios-frontend/gemma4_qat_gather_raw/{embed_per_layer.i8, embed_per_layer.scale.f32}` β |
| the PLE table is checkpoint-derived, so the original `gemma4_gather_raw/` files do NOT |
| match these weights. Everything else (patch stack, chunk threshold, entitlement, AOT) |
| is identical to the PTQ run contract above. **Gemma 4 E4B** (the bigger sibling, also |
| from official QAT weights) lives in its own repo: |
| [gemma-4-E4B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E4B-CoreAI). |
| |
| ## Parity |
| |
| All three sets reproduce the HF eager greedy reference **8/8 top-1 exact** ("What is the capital |
| of France?" β "The capital of France is **Paris**."), verified on macOS conversion and re-verified |
| end-to-end on device per compute unit. |
| |
| ## License |
| |
| Gemma is provided under and subject to the **Gemma Terms of Use** |
| (https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of |
| [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it); by downloading or using |
| them you agree to those terms, including the |
| [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
| |
| CoreML (iOS 18+) variants: [gemma-4-E2B-coreml](https://huggingface.co/mlboydaisuke/gemma-4-E2B-coreml) Β· |
| [gemma-4-E2B-stateful-coreml](https://huggingface.co/mlboydaisuke/gemma-4-E2B-stateful-coreml). |
| |