| --- |
| license: apache-2.0 |
| tags: |
| - coreml |
| - gemma4 |
| - multimodal |
| - vision |
| - on-device |
| - ane |
| base_model: google/gemma-4-E2B-it |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # Gemma 4 E2B β CoreML (ANE+GPU Optimized) |
|
|
| Converted from [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) for on-device inference on Apple devices via CoreML. |
|
|
| ## Models |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `model.mlpackage` | 2.4 GB | Text decoder with stateful KV cache (int4 quantized) | |
| | `vision.mlpackage` | 322 MB | Vision encoder (SigLIP-based, 16 transformer layers) | |
| | `model_config.json` | β | Model configuration | |
| | `hf_model/tokenizer.json` | 31 MB | Tokenizer | |
|
|
| ## Features |
|
|
| - **Multimodal**: Image + text input β text output |
| - **ANE-optimized**: Conv2d linear layers, ANE RMSNorm, in-model argmax |
| - **Stateful KV cache**: MLState API (iOS 18+) |
| - **Int4 quantized**: Block-wise palettization (group_size=32) |
| - **HF-exact match**: "solid red square centered on white background" β
|
| |
| ## Usage |
| |
| ```python |
| import coremltools as ct |
| import numpy as np |
| |
| # Load models |
| vision = ct.models.MLModel('vision.mlpackage') |
| decoder = ct.models.MLModel('model.mlpackage') |
| state = decoder.make_state() |
|
|
| # Process image β vision features β text generation |
| ``` |
| |
| See [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM) for the full conversion pipeline and iOS sample app. |
| |
| ## Conversion |
| |
| ```bash |
| git clone https://github.com/john-rocky/CoreML-LLM |
| cd CoreML-LLM/conversion |
| pip install -r requirements.txt |
| python convert.py --model gemma4-e2b --context-length 512 --output ./output/gemma4-e2b |
| ``` |
| |