| --- |
| license: gemma |
| library_name: coreml |
| base_model: google/gemma-4-E4B-it |
| tags: |
| - coreml |
| - apple-silicon |
| - ane |
| - on-device |
| - gemma-4 |
| - gemma-3n |
| - text-generation |
| pipeline_tag: text-generation |
| --- |
| |
| > **π Multimodal version available** β for text + image + video + audio, |
| > use [`mlboydaisuke/gemma-4-E4B-multimodal-coreml`](https://huggingface.co/mlboydaisuke/gemma-4-E4B-multimodal-coreml). |
| > Same Gemma 4 E4B decoder + ANE-targeted vision encoder + Conformer |
| > audio encoder. Validated 2026-05-03 on iPhone 17 Pro at 15.7 tok/s |
| > with all four input modalities working. This text-only repo stays |
| > available for users who don't need vision/audio. |
|
|
| ## Use it from Swift |
|
|
| <!-- swift-usage-begin --> |
| ### Add the package |
|
|
| `Package.swift`: |
|
|
| ```swift |
| .package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"), |
| |
| // In your target: |
| .product(name: "CoreMLLLM", package: "CoreML-LLM"), |
| ``` |
|
|
| Platforms: iOS 18+ / macOS 15+. |
|
|
| ### Download + chat (one call) |
|
|
| ```swift |
| import CoreMLLLM |
| |
| // First call pulls the bundle from this repo to Documents/Models/. |
| // Subsequent calls reuse the on-disk copy. |
| let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-coreml") |
| |
| let stream = try await llm.generate( |
| [CoreMLLLM.Message(role: .user, content: "Hello!")], |
| maxTokens: 256 |
| ) |
| for await chunk in stream { |
| print(chunk, terminator: "") |
| } |
| ``` |
|
|
| Multi-turn: keep an `[CoreMLLLM.Message]` array, append the |
| user/assistant turns, and pass the whole history to |
| `generate(_:)` again. Call `llm.reset()` to start a new |
| conversation (clears the KV cache). |
| <!-- swift-usage-end --> |
|
|
|
|
|
|
| # Gemma 4 E4B β Core ML (INT4, Apple Neural Engine) |
|
|
| Core ML port of [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) (the 4B-effective Gemma 4 decoder), chunked into 4 sliding-window-attention pieces for Apple Neural Engine. Produced by [`john-rocky/CoreML-LLM`](https://github.com/john-rocky/CoreML-LLM) via: |
|
|
| ```bash |
| python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048 |
| ``` |
|
|
| ## Files |
|
|
| ``` |
| chunk1.mlmodelc/ # L0β11 β INT4 palettized, owns its own KV |
| chunk2.mlmodelc/ # L12β23 β emits kv13_*/kv14_* aliases for producer L22/L23 |
| chunk3.mlmodelc/ # L24β32 β KV-shared |
| chunk4.mlmodelc/ # L33β41 + lm_head β multi-function (decode_q1 + verify_qK) |
| |
| embed_tokens_q8.bin 640 MB β INT8 token embeddings (262144 Γ 2560) |
| embed_tokens_scales.bin 512 KB |
| embed_tokens_per_layer_q8.bin 2.6 GB β INT8 per-layer embeddings (PLE) |
| embed_tokens_per_layer_scales.bin 512 KB |
| per_layer_projection.bin 53 MB β fp16 PLE projection |
| per_layer_norm_weight.bin 512 B β fp16 PLE norm |
| cos_full.npy / cos_sliding.npy 4 MB / 2 MB β precomputed RoPE cos |
| sin_full.npy / sin_sliding.npy 4 MB / 2 MB β precomputed RoPE sin |
| |
| model_config.json 711 B β runtime config (used by the Swift app's loader) |
| hf_model/ |
| βββ tokenizer.json |
| βββ tokenizer_config.json |
| βββ config.json |
| βββ generation_config.json |
| ``` |
|
|
| The Swift runtime renames the producer-layer KV outputs `kv13_*` / `kv14_*` regardless of actual layer index, so the iOS side needs no model-specific wiring. |
|
|
| ## Why so many sidecars (vs a single `model.mlpackage`)? |
|
|
| Gemma 3n / 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB here). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and balloon `phys_footprint`. We mmap the raw INT8 + scale `.bin` files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident. |
|
|
| The `.npy` RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a `cos`/`sin` builder. |
|
|
| ## Tokenizer |
|
|
| Already included in `hf_model/`. If you prefer the upstream copy: |
|
|
| ```python |
| from transformers import AutoTokenizer |
| tok = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it") |
| ``` |
|
|
| ## Standalone usage (Python / Mac) |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| import coremltools as ct, numpy as np, json |
| |
| local = snapshot_download("mlboydaisuke/gemma-4-E4B-coreml") |
| cfg = json.load(open(f"{local}/model_config.json")) |
| chunks = [ct.models.MLModel(f"{local}/chunk{i}.mlmodelc") |
| for i in range(1, 5)] |
| ``` |
|
|
| The .mlmodelc directories carry compiled Core ML programs (no compile step on macOS / iPhone; Mac Studio with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` will execute on ANE directly). |
|
|
| For the full PLE-aware decode loop, see [`Sources/CoreMLLLM/ChunkedEngine.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Sources/CoreMLLLM/ChunkedEngine.swift) β that is the canonical implementation; mirror it in Python by: |
|
|
| 1. mmap'ing `embed_tokens_q8.bin` (uint8) + `embed_tokens_scales.bin` (fp16) and dequantizing the row for the current token, |
| 2. mmap'ing `embed_tokens_per_layer_q8.bin` + `embed_tokens_per_layer_scales.bin` (per-layer rows, dequant on demand), |
| 3. running `chunk1..chunk4`, threading `kv*` outputs from chunk2 as inputs to chunks 3β4 (KV alias names follow the producer-layer convention). |
|
|
| ## iOS / Mac app |
|
|
| Pick **Gemma 4 E4B** in the [`CoreMLLLMChat`](https://github.com/john-rocky/CoreML-LLM/tree/main/Examples/CoreMLLLMChat) model picker β it auto-downloads this repo and runs it via `ChunkedEngine`. |
|
|
| ## Architecture (vs E2B) |
|
|
| | | E2B | E4B | |
| |---|---:|---:| |
| | `num_hidden_layers` | 35 | **42** | |
| | `hidden_size` | 1536 | **2560** | |
| | `num_key_value_heads` | 1 | **2** | |
| | `intermediate_size` | 6144 | **10240** | |
| | `num_kv_shared_layers` | 20 | 18 | |
| | KV producers (sliding/full) | L13 / L14 | **L22 / L23** | |
| | Chunk boundaries | L0-7, L8-14, L15-24, L25-34 | L0-11, L12-23, L24-32, L33-41 | |
|
|
| ## Benchmarks |
|
|
| iPhone 17 Pro, INT4 palettized, ctx=2048, no speculative decoding: |
|
|
| | Metric | Value | |
| |---|---:| |
| | Decode tok/s | **~14 tok/s** | |
| | Per-step latency | ~71 ms | |
| | `phys_footprint` | ~4.5 GB | |
| | ANE placement | 100% | |
|
|
| ## Context length |
|
|
| Shipping bundle is ctx=2048. Rebuild with `--ctx 4096` (or higher) on a sufficiently large Mac to extend; the ANE rejects chunks whose declared context differs from `model_config.json`. |
|
|
| ## License |
|
|
| Inherits the [Gemma terms of use](https://ai.google.dev/gemma/terms) from the base model. |
|
|