gemma-4-E4B-coreml / README.md
mlboydaisuke's picture
docs: announce multimodal sibling repo
c3206d8 verified
---
license: gemma
library_name: coreml
base_model: google/gemma-4-E4B-it
tags:
- coreml
- apple-silicon
- ane
- on-device
- gemma-4
- gemma-3n
- text-generation
pipeline_tag: text-generation
---
> **πŸ†• Multimodal version available** β€” for text + image + video + audio,
> use [`mlboydaisuke/gemma-4-E4B-multimodal-coreml`](https://huggingface.co/mlboydaisuke/gemma-4-E4B-multimodal-coreml).
> Same Gemma 4 E4B decoder + ANE-targeted vision encoder + Conformer
> audio encoder. Validated 2026-05-03 on iPhone 17 Pro at 15.7 tok/s
> with all four input modalities working. This text-only repo stays
> available for users who don't need vision/audio.
## Use it from Swift
<!-- swift-usage-begin -->
### Add the package
`Package.swift`:
```swift
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
```
Platforms: iOS 18+ / macOS 15+.
### Download + chat (one call)
```swift
import CoreMLLLM
// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-coreml")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
```
Multi-turn: keep an `[CoreMLLLM.Message]` array, append the
user/assistant turns, and pass the whole history to
`generate(_:)` again. Call `llm.reset()` to start a new
conversation (clears the KV cache).
<!-- swift-usage-end -->
# Gemma 4 E4B β€” Core ML (INT4, Apple Neural Engine)
Core ML port of [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) (the 4B-effective Gemma 4 decoder), chunked into 4 sliding-window-attention pieces for Apple Neural Engine. Produced by [`john-rocky/CoreML-LLM`](https://github.com/john-rocky/CoreML-LLM) via:
```bash
python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048
```
## Files
```
chunk1.mlmodelc/ # L0–11 β€” INT4 palettized, owns its own KV
chunk2.mlmodelc/ # L12–23 β€” emits kv13_*/kv14_* aliases for producer L22/L23
chunk3.mlmodelc/ # L24–32 β€” KV-shared
chunk4.mlmodelc/ # L33–41 + lm_head β€” multi-function (decode_q1 + verify_qK)
embed_tokens_q8.bin 640 MB β€” INT8 token embeddings (262144 Γ— 2560)
embed_tokens_scales.bin 512 KB
embed_tokens_per_layer_q8.bin 2.6 GB β€” INT8 per-layer embeddings (PLE)
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin 53 MB β€” fp16 PLE projection
per_layer_norm_weight.bin 512 B β€” fp16 PLE norm
cos_full.npy / cos_sliding.npy 4 MB / 2 MB β€” precomputed RoPE cos
sin_full.npy / sin_sliding.npy 4 MB / 2 MB β€” precomputed RoPE sin
model_config.json 711 B β€” runtime config (used by the Swift app's loader)
hf_model/
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ config.json
└── generation_config.json
```
The Swift runtime renames the producer-layer KV outputs `kv13_*` / `kv14_*` regardless of actual layer index, so the iOS side needs no model-specific wiring.
## Why so many sidecars (vs a single `model.mlpackage`)?
Gemma 3n / 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB here). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and balloon `phys_footprint`. We mmap the raw INT8 + scale `.bin` files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident.
The `.npy` RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a `cos`/`sin` builder.
## Tokenizer
Already included in `hf_model/`. If you prefer the upstream copy:
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it")
```
## Standalone usage (Python / Mac)
```python
from huggingface_hub import snapshot_download
import coremltools as ct, numpy as np, json
local = snapshot_download("mlboydaisuke/gemma-4-E4B-coreml")
cfg = json.load(open(f"{local}/model_config.json"))
chunks = [ct.models.MLModel(f"{local}/chunk{i}.mlmodelc")
for i in range(1, 5)]
```
The .mlmodelc directories carry compiled Core ML programs (no compile step on macOS / iPhone; Mac Studio with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` will execute on ANE directly).
For the full PLE-aware decode loop, see [`Sources/CoreMLLLM/ChunkedEngine.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Sources/CoreMLLLM/ChunkedEngine.swift) β€” that is the canonical implementation; mirror it in Python by:
1. mmap'ing `embed_tokens_q8.bin` (uint8) + `embed_tokens_scales.bin` (fp16) and dequantizing the row for the current token,
2. mmap'ing `embed_tokens_per_layer_q8.bin` + `embed_tokens_per_layer_scales.bin` (per-layer rows, dequant on demand),
3. running `chunk1..chunk4`, threading `kv*` outputs from chunk2 as inputs to chunks 3–4 (KV alias names follow the producer-layer convention).
## iOS / Mac app
Pick **Gemma 4 E4B** in the [`CoreMLLLMChat`](https://github.com/john-rocky/CoreML-LLM/tree/main/Examples/CoreMLLLMChat) model picker β€” it auto-downloads this repo and runs it via `ChunkedEngine`.
## Architecture (vs E2B)
| | E2B | E4B |
|---|---:|---:|
| `num_hidden_layers` | 35 | **42** |
| `hidden_size` | 1536 | **2560** |
| `num_key_value_heads` | 1 | **2** |
| `intermediate_size` | 6144 | **10240** |
| `num_kv_shared_layers` | 20 | 18 |
| KV producers (sliding/full) | L13 / L14 | **L22 / L23** |
| Chunk boundaries | L0-7, L8-14, L15-24, L25-34 | L0-11, L12-23, L24-32, L33-41 |
## Benchmarks
iPhone 17 Pro, INT4 palettized, ctx=2048, no speculative decoding:
| Metric | Value |
|---|---:|
| Decode tok/s | **~14 tok/s** |
| Per-step latency | ~71 ms |
| `phys_footprint` | ~4.5 GB |
| ANE placement | 100% |
## Context length
Shipping bundle is ctx=2048. Rebuild with `--ctx 4096` (or higher) on a sufficiently large Mac to extend; the ANE rejects chunks whose declared context differs from `model_config.json`.
## License
Inherits the [Gemma terms of use](https://ai.google.dev/gemma/terms) from the base model.