mlboydaisuke
/

gemma-4-E2B-coreml

Image-Text-to-Text

Model card Files Files and versions

gemma-4-E2B-coreml / README.md

mlboydaisuke's picture

Upload folder using huggingface_hub

385d64e verified 5 days ago

|

history blame contribute delete

1.63 kB

	---
	license: apache-2.0
	tags:
	- coreml
	- gemma4
	- multimodal
	- vision
	- on-device
	- ane
	base_model: google/gemma-4-E2B-it
	pipeline_tag: image-text-to-text
	---

	# Gemma 4 E2B — CoreML (ANE+GPU Optimized)

	Converted from [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) for on-device inference on Apple devices via CoreML.

	## Models

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `model.mlpackage` \| 2.4 GB \| Text decoder with stateful KV cache (int4 quantized) \|
	\| `vision.mlpackage` \| 322 MB \| Vision encoder (SigLIP-based, 16 transformer layers) \|
	\| `model_config.json` \| — \| Model configuration \|
	\| `hf_model/tokenizer.json` \| 31 MB \| Tokenizer \|

	## Features

	- Multimodal: Image + text input → text output
	- ANE-optimized: Conv2d linear layers, ANE RMSNorm, in-model argmax
	- Stateful KV cache: MLState API (iOS 18+)
	- Int4 quantized: Block-wise palettization (group_size=32)
	- HF-exact match: "solid red square centered on white background" ✅

	## Usage

	```python
	import coremltools as ct
	import numpy as np

	# Load models
	vision = ct.models.MLModel('vision.mlpackage')
	decoder = ct.models.MLModel('model.mlpackage')
	state = decoder.make_state()

	# Process image → vision features → text generation
	```

	See [CoreML-LLM](https://github.com/john-rocky/CoreML-LLM) for the full conversion pipeline and iOS sample app.

	## Conversion

	```bash
	git clone https://github.com/john-rocky/CoreML-LLM
	cd CoreML-LLM/conversion
	pip install -r requirements.txt
	python convert.py --model gemma4-e2b --context-length 512 --output ./output/gemma4-e2b
	```