File size: 1,626 Bytes
385d64e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
tags:
  - coreml
  - gemma4
  - multimodal
  - vision
  - on-device
  - ane
base_model: google/gemma-4-E2B-it
pipeline_tag: image-text-to-text
---

# Gemma 4 E2B — CoreML (ANE+GPU Optimized)

Converted from [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) for on-device inference on Apple devices via CoreML.

## Models

| File | Size | Description |
|------|------|-------------|
| `model.mlpackage` | 2.4 GB | Text decoder with stateful KV cache (int4 quantized) |
| `vision.mlpackage` | 322 MB | Vision encoder (SigLIP-based, 16 transformer layers) |
| `model_config.json` | — | Model configuration |
| `hf_model/tokenizer.json` | 31 MB | Tokenizer |

## Features

- **Multimodal**: Image + text input → text output
- **ANE-optimized**: Conv2d linear layers, ANE RMSNorm, in-model argmax
- **Stateful KV cache**: MLState API (iOS 18+)
- **Int4 quantized**: Block-wise palettization (group_size=32)
- **HF-exact match**: "solid red square centered on white background" ✅

## Usage

```python
import coremltools as ct
import numpy as np

# Load models
vision = ct.models.MLModel('vision.mlpackage')
decoder = ct.models.MLModel('model.mlpackage')
state = decoder.make_state()

# Process image → vision features → text generation
```

See [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM) for the full conversion pipeline and iOS sample app.

## Conversion

```bash
git clone https://github.com/john-rocky/CoreML-LLM
cd CoreML-LLM/conversion
pip install -r requirements.txt
python convert.py --model gemma4-e2b --context-length 512 --output ./output/gemma4-e2b
```