Gemma 4 E2B β€” Core ML stateful (3-chunk Linear)

Stateful Core ML port of google/gemma-4-E2B-it. KV cache lives inside ANE via MLState + slice_update so the recurrent KV plumbing of the legacy bundle (see mlboydaisuke/gemma-4-E2B-coreml) is gone β€” Core ML manages the cache implicitly. Same model, smaller phys_footprint, faster cold start.

Files

chunk_1.mlmodelc/   # 155 MB weights β€” embed + L0–7
chunk_2.mlmodelc/   # 459 MB weights β€” L8–24 (merged middle)
chunk_3.mlmodelc/   # 527 MB weights β€” L25–34 + lm_head

embed_tokens_q8.bin              402 MB β€” INT8 token embeddings (262144 Γ— 1536)
embed_tokens_scales.bin          512 KB
embed_tokens_per_layer_q8.bin    2.35 GB β€” INT8 PLE
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin         27.5 MB
per_layer_norm_weight.bin        1 KB
cos_{full,sliding}.npy           8 MB / 4 MB β€” precomputed RoPE cos
sin_{full,sliding}.npy           8 MB / 4 MB β€” precomputed RoPE sin

model_config.json                620 B  β€” runtime config
hf_model/{tokenizer.json, tokenizer_config.json, config.json}

The .mlmodelc chunks declare MLState inputs internally, so the Swift runtime only needs to call make_state() once per chunk and pass the same state object back on each step.

Why three chunks (not one mlpackage)

Each chunk weighs in below the iPhone ANE single-mlprogram compile envelope (~1 GB fp16). Splitting at L8 / L24 keeps every chunk safely under the limit and keeps the lm_head weights (which dominate chunk 3) isolated so it can be unloaded for memory-tight cases.

Why a separate PLE sidecar (not a Core ML weight)

Gemma 4 / 3n's PLE bank is 2.35 GB INT8. Loading it through Core ML would dequant the whole table into the CPU heap and add ~5 GB to phys_footprint. We mmap the raw bytes and dequant the few rows touched per token in pure Swift. That keeps the on-device footprint at ~700 MB resident even with the full PLE on disk.

Tokenizer

Already in hf_model/. Or pull from upstream:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")

Standalone usage (Python / Mac)

from huggingface_hub import snapshot_download
import coremltools as ct, json, numpy as np

local = snapshot_download("mlboydaisuke/gemma-4-E2B-stateful-coreml")
cfg   = json.load(open(f"{local}/model_config.json"))

chunks = [
    ct.models.MLModel(f"{local}/chunk_{i}.mlmodelc")
    for i in (1, 2, 3)
]
states = [m.make_state() for m in chunks]

# RoPE tables (concrete arrays, no builder needed):
cos_full    = np.load(f"{local}/cos_full.npy")
cos_sliding = np.load(f"{local}/cos_sliding.npy")

The complete decode loop, including PLE dequant + state plumbing, is implemented in Swift in Sources/CoreMLLLM/Gemma4StatefulEngine.swift β€” mirror it in Python by passing each chunk's state object through every predict(...) call.

iOS / Mac app

Pick Gemma 4 E2B (stateful, Linear) in CoreMLLLMChat β€” it auto-downloads this repo and runs it via Gemma4StatefulEngine.

Architecture

Same as the legacy E2B:

value
num_hidden_layers 35
hidden_size 1536
num_key_value_heads 1
intermediate_size 6144
num_kv_shared_layers 20
KV producers (sliding/full) L13 / L14
sliding window 512
context length 1024
vocab 262144

License

Inherits the Gemma terms of use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/gemma-4-E2B-stateful-coreml

Finetuned
(157)
this model