--- license: apache-2.0 tags: - coreml - gemma4 - multimodal - vision - on-device - ane base_model: google/gemma-4-E2B-it pipeline_tag: image-text-to-text --- # Gemma 4 E2B — CoreML (ANE+GPU Optimized) Converted from [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) for on-device inference on Apple devices via CoreML. ## Models | File | Size | Description | |------|------|-------------| | `model.mlpackage` | 2.4 GB | Text decoder with stateful KV cache (int4 quantized) | | `vision.mlpackage` | 322 MB | Vision encoder (SigLIP-based, 16 transformer layers) | | `model_config.json` | — | Model configuration | | `hf_model/tokenizer.json` | 31 MB | Tokenizer | ## Features - **Multimodal**: Image + text input → text output - **ANE-optimized**: Conv2d linear layers, ANE RMSNorm, in-model argmax - **Stateful KV cache**: MLState API (iOS 18+) - **Int4 quantized**: Block-wise palettization (group_size=32) - **HF-exact match**: "solid red square centered on white background" ✅ ## Usage ```python import coremltools as ct import numpy as np # Load models vision = ct.models.MLModel('vision.mlpackage') decoder = ct.models.MLModel('model.mlpackage') state = decoder.make_state() # Process image → vision features → text generation ``` See [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM) for the full conversion pipeline and iOS sample app. ## Conversion ```bash git clone https://github.com/john-rocky/CoreML-LLM cd CoreML-LLM/conversion pip install -r requirements.txt python convert.py --model gemma4-e2b --context-length 512 --output ./output/gemma4-e2b ```