Update model card

6de8bd2 verified 3 days ago

8.14 kB

	---
	license: apache-2.0
	language:
	- en
	base_model: google/gemma-4-e2b-it
	tags:
	- executorch
	- quantized
	- int4
	- raspberry-pi
	- on-device
	- edge
	pipeline_tag: text-generation
	---

	# Gemma 4 E2B — INT4 ExecuTorch `.pte` for Raspberry Pi 5

	INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for Raspberry Pi 5 (Cortex-A76, 8 GB) deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.

	This artifact is the deployable output of the full export → quantize → lower → runtime pipeline documented at:

	Source code & documentation: https://github.com/bamb00boy/Gemma4_executorch_deployment

	## Contents

	\| File \| Size \| Purpose \|
	\|---\|---\|---\|
	\| `gemma4_e2b_text_int4_extcache.pte` \| 5.14 GB \| The ExecuTorch program — load + run with `executorch==1.2.0` \|
	\| `tokenizer/tokenizer.json` \| ~5 MB \| HF fast tokenizer for Gemma 4 \|
	\| `tokenizer/tokenizer_config.json` \| small \| Tokenizer config (special tokens, chat template ref) \|
	\| `tokenizer/chat_template.jinja` \| small \| Gemma 4 chat template (used by `gemma4_terminal_chat.py`) \|
	\| `pi_runner.py` \| ~250 lines \| Self-contained one-shot runner: tokenize → generate → exit \|
	\| `gemma4_terminal_chat.py` \| ~325 lines \| Interactive multi-turn REPL with KV-cache reuse across turns \|
	\| `LICENSE` \| — \| Apache 2.0 (covers the weights; see [License](#license) below) \|

	## Measured performance

	Identical 14-token prompt + 9-token decode for `"The capital of France is"`, bit-exact output across all rows.

	\| Host \| Role \| Prompt feed \| Decode \| Total wall \|
	\|---\|---\|---\|---\|---\|
	\| Raspberry Pi 5 — 8 GB, Cortex-A76 @ 2.4 GHz, Ubuntu Server 24.04 LTS, microSD \| deployment target \| 0.6–0.8 tok/s \| 0.72–0.87 tok/s \| 28–35 s \|
	\| MacBook Pro 14" — Apple M1 Pro (6P+2E), 16 GB unified, macOS 26.3.1 \| development reference \| 7.20 tok/s \| 8.66 tok/s \| 2.99 s \|
	\| [potato-os/core llama.cpp on Pi 5](https://github.com/potato-os/core/blob/main/docs/benchmarks/gemma4-pi-benchmark-2026-04-04.md) \| external reference (different runtime) \| n/a \| 6.71 tok/s \| n/a \|

	Pi 5 decode rate varies ±15% across sessions (small per-prompt sample, thermal state, default `schedutil` cpufreq governor). For stable benchmarking, pin the governor to `performance` and let the SoC return to <55°C between runs.

	Output quality: bit-exact 9/9 token match against the FP32 reference on the canonical prompt.

	The Pi 5 decode is approximately 7.7× slower than `llama.cpp` on identical hardware. The shipped build uses `XnnpackPartitioner(per_op_mode=True)` to work around an ARM XNNPACK rejection bug in ExecuTorch 1.2.0 — initially believed to be the entire cause of the gap. Three controlled follow-up experiments (PT2E quantization, `config_precisions=DYNAMIC_QUANT`, ExecuTorch nightly 1.4.0.dev with the default fused partitioner) confirm that the partitioner mode is not the bottleneck — even with 508 fused subgraphs on ARM nightly the decode rate does not improve. The ARM XNNPACK rejection itself IS fixed in nightly; stable 1.3+ should ship the fix. The remaining gap likely lives in the XNNPACK kernel format (vs llama.cpp's GGUF Q4_K_M hand-tuned ARM kernels), the KleidiAI link status of the executorch wheel's XNNPACK build, or the external KV-cache materialization pattern. Full diagnosis with the measured three-way Pi benchmark at [KNOWN_ISSUES.md #1](https://github.com/bamb00boy/Gemma4_executorch_deployment/blob/master/KNOWN_ISSUES.md) in the source repo. If maximum Pi 5 decode throughput is the priority, `llama.cpp` is the appropriate tool today.

	## Quick use on a Raspberry Pi 5

	```bash
	# 1. Download the bundle (~5.2 GB)
	pip install --user huggingface_hub
	hf download bamb00boy/gemma4-e2b-int4-executorch-pi5 --local-dir ~/gemma4

	# 2. Set up the runtime environment
	cd ~/gemma4
	python3 -m venv .venv && source .venv/bin/activate
	pip install --upgrade pip
	pip install torch==2.11.0 executorch==1.2.0 transformers==5.5.3

	# 3. Verify (should print "RESULT: PASS" and "The capital of France is Paris.")
	python pi_runner.py --verify

	# 4a. One-shot generation
	python pi_runner.py "Your prompt here" --max-new-tokens 50

	# 4b. Or an interactive multi-turn chat (KV-cache reused across turns)
	python gemma4_terminal_chat.py
	# Type a message + Enter. /help for commands. Ctrl+C or Ctrl+D to exit.
	```

	The Pi setup guide (OS install, performance tuning, SSH) lives in [docs/pi5_setup.md](https://github.com/bamb00boy/Gemma4_executorch_deployment/blob/master/docs/pi5_setup.md) in the source repo.

	## Use on other hosts

	The `.pte` runs on any host with ExecuTorch 1.2.0 + XNNPACK. It has been validated on:

	- aarch64 Linux (Raspberry Pi 5, Ubuntu Server 24.04)
	- macOS arm64 (Apple Silicon, used as the development reference)

	x86_64 Linux is expected to work (XNNPACK supports it) but is untested by this project.

	## What's quantized, what's not

	\| Component \| Treatment \|
	\|---\|---\|
	\| `nn.Linear` weights (~3.1 B params) \| INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) \|
	\| `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) \| INT8 per-row via a custom `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`) \|
	\| `embed_tokens` (~0.4 B params) \| FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers \|
	\| Layer norms, RoPE buffers, biases \| FP32 \|
	\| Runtime K/V cache \| FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) \|

	Disk size: 5.14 GB. Runtime cache footprint: 18.9 MB across 15 layers (12 sliding-window @ `head_dim=256`, 3 full-attention @ `head_dim=512`).

	## Architecture & shape constraints

	- Sequence length: padded to 511 tokens at runtime (the `.pte` shape-specializes to the upper bound of the dynamic dim).
	- Decode: token-by-token (no batched prefill in this build).
	- Maximum total context: 511 tokens (prompt + generated combined).
	- Cache: externalized — 90 cache tensors are passed as graph inputs and 45 are returned as graph outputs each call (one K + one V per layer × 15 layers + sentinel for prefill vs decode).

	## License

	The weights in this file are derived from [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it) and are licensed under Apache License 2.0 by Google DeepMind. Use is subject to:

	- [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0) (text included in this repo as `LICENSE`)
	- [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy)
	- [Gemma 4 Apache 2.0 announcement](https://ai.google.dev/gemma/apache_2)

	This is a derivative work: INT4 weight-only quantization of `nn.Linear` weights and INT8 per-row quantization of `embed_tokens_per_layer`, followed by ExecuTorch program lowering with the XNNPACK backend. No additional fine-tuning has been performed.

	The packaging code (`pi_runner.py`, `gemma4_terminal_chat.py`, and the export/quantize/lower pipeline) is released under MIT — see the [source GitHub repo](https://github.com/bamb00boy/Gemma4_executorch_deployment).

	## Attribution

	```
	Original Gemma 4 weights © Google DeepMind, released under Apache 2.0.
	INT4 quantization + ExecuTorch lowering: derivative work by the
	Gemma4_executorch_deployment contributors (https://github.com/bamb00boy/Gemma4_executorch_deployment).
	```

	## Citation

	If this artifact is useful in research, please cite both the original Gemma 4 release and this packaging:

	```bibtex
	@misc{gemma4-e2b-int4-executorch-pi5,
	title = {Gemma 4 E2B INT4 ExecuTorch for Raspberry Pi 5},
	author = {bamb00boy and Gemma4_executorch_deployment contributors},
	year = {2026},
	url = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
	note = {Source repository: https://github.com/bamb00boy/Gemma4_executorch_deployment}
	}
	```

	For the upstream Gemma 4 model, see [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it).