Gemma 4 E2B — Core ML stateful (3-chunk Linear)

Stateful Core ML port of google/gemma-4-E2B-it. KV cache lives inside ANE via MLState + slice_update so the recurrent KV plumbing of the legacy bundle (see mlboydaisuke/gemma-4-E2B-coreml) is gone — Core ML manages the cache implicitly. Same model, smaller phys_footprint, faster cold start.

Files

chunk_1.mlmodelc/   # 155 MB weights — embed + L0–7
chunk_2.mlmodelc/   # 459 MB weights — L8–24 (merged middle)
chunk_3.mlmodelc/   # 527 MB weights — L25–34 + lm_head

embed_tokens_q8.bin              402 MB — INT8 token embeddings (262144 × 1536)
embed_tokens_scales.bin          512 KB
embed_tokens_per_layer_q8.bin    2.35 GB — INT8 PLE
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin         27.5 MB
per_layer_norm_weight.bin        1 KB
cos_{full,sliding}.npy           8 MB / 4 MB — precomputed RoPE cos
sin_{full,sliding}.npy           8 MB / 4 MB — precomputed RoPE sin

model_config.json                620 B  — runtime config
hf_model/{tokenizer.json, tokenizer_config.json, config.json}

The .mlmodelc chunks declare MLState inputs internally, so the Swift runtime only needs to call make_state() once per chunk and pass the same state object back on each step.

Why three chunks (not one mlpackage)

Each chunk weighs in below the iPhone ANE single-mlprogram compile envelope (~1 GB fp16). Splitting at L8 / L24 keeps every chunk safely under the limit and keeps the lm_head weights (which dominate chunk 3) isolated so it can be unloaded for memory-tight cases.

Why a separate PLE sidecar (not a Core ML weight)

Gemma 4 / 3n's PLE bank is 2.35 GB INT8. Loading it through Core ML would dequant the whole table into the CPU heap and add ~5 GB to phys_footprint. We mmap the raw bytes and dequant the few rows touched per token in pure Swift. That keeps the on-device footprint at ~700 MB resident even with the full PLE on disk.

Tokenizer

Already in hf_model/. Or pull from upstream:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")

Standalone usage (Python / Mac)

from huggingface_hub import snapshot_download
import coremltools as ct, json, numpy as np

local = snapshot_download("mlboydaisuke/gemma-4-E2B-stateful-coreml")
cfg   = json.load(open(f"{local}/model_config.json"))

chunks = [
    ct.models.MLModel(f"{local}/chunk_{i}.mlmodelc")
    for i in (1, 2, 3)
]
states = [m.make_state() for m in chunks]

# RoPE tables (concrete arrays, no builder needed):
cos_full    = np.load(f"{local}/cos_full.npy")
cos_sliding = np.load(f"{local}/cos_sliding.npy")

The complete decode loop, including PLE dequant + state plumbing, is implemented in Swift in Sources/CoreMLLLM/Gemma4StatefulEngine.swift — mirror it in Python by passing each chunk's state object through every predict(...) call.

iOS / Mac app

Pick Gemma 4 E2B (stateful, Linear) in CoreMLLLMChat — it auto-downloads this repo and runs it via Gemma4StatefulEngine.

Architecture

Same as the legacy E2B:

	value
`num_hidden_layers`	35
`hidden_size`	1536
`num_key_value_heads`	1
`intermediate_size`	6144
`num_kv_shared_layers`	20
KV producers (sliding/full)	L13 / L14
sliding window	512
context length	1024
vocab	262144

License

Inherits the Gemma terms of use.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/gemma-4-E2B-stateful-coreml

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(157)

this model