docs: announce multimodal sibling repo

c3206d8 verified about 1 month ago

6.36 kB

	---
	license: gemma
	library_name: coreml
	base_model: google/gemma-4-E4B-it
	tags:
	- coreml
	- apple-silicon
	- ane
	- on-device
	- gemma-4
	- gemma-3n
	- text-generation
	pipeline_tag: text-generation
	---

	> 🆕 Multimodal version available — for text + image + video + audio,
	> use [`mlboydaisuke/gemma-4-E4B-multimodal-coreml`](https://huggingface.co/mlboydaisuke/gemma-4-E4B-multimodal-coreml).
	> Same Gemma 4 E4B decoder + ANE-targeted vision encoder + Conformer
	> audio encoder. Validated 2026-05-03 on iPhone 17 Pro at 15.7 tok/s
	> with all four input modalities working. This text-only repo stays
	> available for users who don't need vision/audio.

	## Use it from Swift

	<!-- swift-usage-begin -->
	### Add the package

	`Package.swift`:

	```swift
	.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

	// In your target:
	.product(name: "CoreMLLLM", package: "CoreML-LLM"),
	```

	Platforms: iOS 18+ / macOS 15+.

	### Download + chat (one call)

	```swift
	import CoreMLLLM

	// First call pulls the bundle from this repo to Documents/Models/.
	// Subsequent calls reuse the on-disk copy.
	let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-coreml")

	let stream = try await llm.generate(
	[CoreMLLLM.Message(role: .user, content: "Hello!")],
	maxTokens: 256
	)
	for await chunk in stream {
	print(chunk, terminator: "")
	}
	```

	Multi-turn: keep an `[CoreMLLLM.Message]` array, append the
	user/assistant turns, and pass the whole history to
	`generate(_:)` again. Call `llm.reset()` to start a new
	conversation (clears the KV cache).
	<!-- swift-usage-end -->



	# Gemma 4 E4B — Core ML (INT4, Apple Neural Engine)

	Core ML port of [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) (the 4B-effective Gemma 4 decoder), chunked into 4 sliding-window-attention pieces for Apple Neural Engine. Produced by [`john-rocky/CoreML-LLM`](https://github.com/john-rocky/CoreML-LLM) via:

	```bash
	python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048
	```

	## Files

	```
	chunk1.mlmodelc/ # L0–11 — INT4 palettized, owns its own KV
	chunk2.mlmodelc/ # L12–23 — emits kv13_/kv14_ aliases for producer L22/L23
	chunk3.mlmodelc/ # L24–32 — KV-shared
	chunk4.mlmodelc/ # L33–41 + lm_head — multi-function (decode_q1 + verify_qK)

	embed_tokens_q8.bin 640 MB — INT8 token embeddings (262144 × 2560)
	embed_tokens_scales.bin 512 KB
	embed_tokens_per_layer_q8.bin 2.6 GB — INT8 per-layer embeddings (PLE)
	embed_tokens_per_layer_scales.bin 512 KB
	per_layer_projection.bin 53 MB — fp16 PLE projection
	per_layer_norm_weight.bin 512 B — fp16 PLE norm
	cos_full.npy / cos_sliding.npy 4 MB / 2 MB — precomputed RoPE cos
	sin_full.npy / sin_sliding.npy 4 MB / 2 MB — precomputed RoPE sin

	model_config.json 711 B — runtime config (used by the Swift app's loader)
	hf_model/
	├── tokenizer.json
	├── tokenizer_config.json
	├── config.json
	└── generation_config.json
	```

	The Swift runtime renames the producer-layer KV outputs `kv13_` / `kv14_` regardless of actual layer index, so the iOS side needs no model-specific wiring.

	## Why so many sidecars (vs a single `model.mlpackage`)?

	Gemma 3n / 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB here). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and balloon `phys_footprint`. We mmap the raw INT8 + scale `.bin` files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident.

	The `.npy` RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a `cos`/`sin` builder.

	## Tokenizer

	Already included in `hf_model/`. If you prefer the upstream copy:

	```python
	from transformers import AutoTokenizer
	tok = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it")
	```

	## Standalone usage (Python / Mac)

	```python
	from huggingface_hub import snapshot_download
	import coremltools as ct, numpy as np, json

	local = snapshot_download("mlboydaisuke/gemma-4-E4B-coreml")
	cfg = json.load(open(f"{local}/model_config.json"))
	chunks = [ct.models.MLModel(f"{local}/chunk{i}.mlmodelc")
	for i in range(1, 5)]
	```

	The .mlmodelc directories carry compiled Core ML programs (no compile step on macOS / iPhone; Mac Studio with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` will execute on ANE directly).

	For the full PLE-aware decode loop, see [`Sources/CoreMLLLM/ChunkedEngine.swift`](https://github.com/john-rocky/CoreML-LLM/blob/main/Sources/CoreMLLLM/ChunkedEngine.swift) — that is the canonical implementation; mirror it in Python by:

	1. mmap'ing `embed_tokens_q8.bin` (uint8) + `embed_tokens_scales.bin` (fp16) and dequantizing the row for the current token,
	2. mmap'ing `embed_tokens_per_layer_q8.bin` + `embed_tokens_per_layer_scales.bin` (per-layer rows, dequant on demand),
	3. running `chunk1..chunk4`, threading `kv*` outputs from chunk2 as inputs to chunks 3–4 (KV alias names follow the producer-layer convention).

	## iOS / Mac app

	Pick Gemma 4 E4B in the [`CoreMLLLMChat`](https://github.com/john-rocky/CoreML-LLM/tree/main/Examples/CoreMLLLMChat) model picker — it auto-downloads this repo and runs it via `ChunkedEngine`.

	## Architecture (vs E2B)

	\| \| E2B \| E4B \|
	\|---\|---:\|---:\|
	\| `num_hidden_layers` \| 35 \| 42 \|
	\| `hidden_size` \| 1536 \| 2560 \|
	\| `num_key_value_heads` \| 1 \| 2 \|
	\| `intermediate_size` \| 6144 \| 10240 \|
	\| `num_kv_shared_layers` \| 20 \| 18 \|
	\| KV producers (sliding/full) \| L13 / L14 \| L22 / L23 \|
	\| Chunk boundaries \| L0-7, L8-14, L15-24, L25-34 \| L0-11, L12-23, L24-32, L33-41 \|

	## Benchmarks

	iPhone 17 Pro, INT4 palettized, ctx=2048, no speculative decoding:

	\| Metric \| Value \|
	\|---\|---:\|
	\| Decode tok/s \| ~14 tok/s \|
	\| Per-step latency \| ~71 ms \|
	\| `phys_footprint` \| ~4.5 GB \|
	\| ANE placement \| 100% \|

	## Context length

	Shipping bundle is ctx=2048. Rebuild with `--ctx 4096` (or higher) on a sufficiently large Mac to extend; the ANE rejects chunks whose declared context differs from `model_config.json`.

	## License

	Inherits the [Gemma terms of use](https://ai.google.dev/gemma/terms) from the base model.