Gemma 4 E2B β Core ML stateful (3-chunk Linear)
Stateful Core ML port of google/gemma-4-E2B-it. KV cache lives inside ANE via MLState + slice_update so the recurrent KV plumbing of the legacy bundle (see mlboydaisuke/gemma-4-E2B-coreml) is gone β Core ML manages the cache implicitly. Same model, smaller phys_footprint, faster cold start.
Files
chunk_1.mlmodelc/ # 155 MB weights β embed + L0β7
chunk_2.mlmodelc/ # 459 MB weights β L8β24 (merged middle)
chunk_3.mlmodelc/ # 527 MB weights β L25β34 + lm_head
embed_tokens_q8.bin 402 MB β INT8 token embeddings (262144 Γ 1536)
embed_tokens_scales.bin 512 KB
embed_tokens_per_layer_q8.bin 2.35 GB β INT8 PLE
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin 27.5 MB
per_layer_norm_weight.bin 1 KB
cos_{full,sliding}.npy 8 MB / 4 MB β precomputed RoPE cos
sin_{full,sliding}.npy 8 MB / 4 MB β precomputed RoPE sin
model_config.json 620 B β runtime config
hf_model/{tokenizer.json, tokenizer_config.json, config.json}
The .mlmodelc chunks declare MLState inputs internally, so the Swift runtime only needs to call make_state() once per chunk and pass the same state object back on each step.
Why three chunks (not one mlpackage)
Each chunk weighs in below the iPhone ANE single-mlprogram compile envelope (~1 GB fp16). Splitting at L8 / L24 keeps every chunk safely under the limit and keeps the lm_head weights (which dominate chunk 3) isolated so it can be unloaded for memory-tight cases.
Why a separate PLE sidecar (not a Core ML weight)
Gemma 4 / 3n's PLE bank is 2.35 GB INT8. Loading it through Core ML would dequant the whole table into the CPU heap and add ~5 GB to phys_footprint. We mmap the raw bytes and dequant the few rows touched per token in pure Swift. That keeps the on-device footprint at ~700 MB resident even with the full PLE on disk.
Tokenizer
Already in hf_model/. Or pull from upstream:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")
Standalone usage (Python / Mac)
from huggingface_hub import snapshot_download
import coremltools as ct, json, numpy as np
local = snapshot_download("mlboydaisuke/gemma-4-E2B-stateful-coreml")
cfg = json.load(open(f"{local}/model_config.json"))
chunks = [
ct.models.MLModel(f"{local}/chunk_{i}.mlmodelc")
for i in (1, 2, 3)
]
states = [m.make_state() for m in chunks]
# RoPE tables (concrete arrays, no builder needed):
cos_full = np.load(f"{local}/cos_full.npy")
cos_sliding = np.load(f"{local}/cos_sliding.npy")
The complete decode loop, including PLE dequant + state plumbing, is implemented in Swift in Sources/CoreMLLLM/Gemma4StatefulEngine.swift β mirror it in Python by passing each chunk's state object through every predict(...) call.
iOS / Mac app
Pick Gemma 4 E2B (stateful, Linear) in CoreMLLLMChat β it auto-downloads this repo and runs it via Gemma4StatefulEngine.
Architecture
Same as the legacy E2B:
| value | |
|---|---|
num_hidden_layers |
35 |
hidden_size |
1536 |
num_key_value_heads |
1 |
intermediate_size |
6144 |
num_kv_shared_layers |
20 |
| KV producers (sliding/full) | L13 / L14 |
| sliding window | 512 |
| context length | 1024 |
| vocab | 262144 |
License
Inherits the Gemma terms of use.