Gemma 4 E2B — INT4 ExecuTorch `.pte` for Raspberry Pi 5

INT4-quantized, ExecuTorch-lowered .pte of google/gemma-4-e2b-it, packaged for Raspberry Pi 5 (Cortex-A76, 8 GB) deployment via the ExecuTorch 1.2.0 Python runtime with the XNNPACK backend.

This artifact is the deployable output of the full export → quantize → lower → runtime pipeline documented at:

Source code & documentation: https://github.com/bamb00boy/Gemma4_executorch_deployment

File	Size	Purpose
`gemma4_e2b_text_int4_extcache.pte`	5.14 GB	The ExecuTorch program — load + run with `executorch==1.2.0`
`tokenizer/tokenizer.json`	~5 MB	HF fast tokenizer for Gemma 4
`tokenizer/tokenizer_config.json`	small	Tokenizer config (special tokens, chat template ref)
`tokenizer/chat_template.jinja`	small	Gemma 4 chat template (used by `gemma4_terminal_chat.py`)
`pi_runner.py`	~250 lines	Self-contained one-shot runner: tokenize → generate → exit
`gemma4_terminal_chat.py`	~325 lines	Interactive multi-turn REPL with KV-cache reuse across turns
`LICENSE`	—	Apache 2.0 (covers the weights; see License below)

Measured performance

Identical 14-token prompt + 9-token decode for "The capital of France is", bit-exact output across all rows.

Host	Role	Prompt feed	Decode	Total wall
Raspberry Pi 5 — 8 GB, Cortex-A76 @ 2.4 GHz, Ubuntu Server 24.04 LTS, microSD	deployment target	0.6–0.8 tok/s	0.72–0.87 tok/s	28–35 s
MacBook Pro 14" — Apple M1 Pro (6P+2E), 16 GB unified, macOS 26.3.1	development reference	7.20 tok/s	8.66 tok/s	2.99 s
potato-os/core llama.cpp on Pi 5	external reference (different runtime)	n/a	6.71 tok/s	n/a

Pi 5 decode rate varies ±15% across sessions (small per-prompt sample, thermal state, default schedutil cpufreq governor). For stable benchmarking, pin the governor to performance and let the SoC return to <55°C between runs.

Output quality: bit-exact 9/9 token match against the FP32 reference on the canonical prompt.

The Pi 5 decode is approximately 7.7× slower than llama.cpp on identical hardware. The shipped build uses XnnpackPartitioner(per_op_mode=True) to work around an ARM XNNPACK rejection bug in ExecuTorch 1.2.0 — initially believed to be the entire cause of the gap. Three controlled follow-up experiments (PT2E quantization, config_precisions=DYNAMIC_QUANT, ExecuTorch nightly 1.4.0.dev with the default fused partitioner) confirm that the partitioner mode is not the bottleneck — even with 508 fused subgraphs on ARM nightly the decode rate does not improve. The ARM XNNPACK rejection itself IS fixed in nightly; stable 1.3+ should ship the fix. The remaining gap likely lives in the XNNPACK kernel format (vs llama.cpp's GGUF Q4_K_M hand-tuned ARM kernels), the KleidiAI link status of the executorch wheel's XNNPACK build, or the external KV-cache materialization pattern. Full diagnosis with the measured three-way Pi benchmark at KNOWN_ISSUES.md #1 in the source repo. If maximum Pi 5 decode throughput is the priority, llama.cpp is the appropriate tool today.

Quick use on a Raspberry Pi 5

# 1. Download the bundle (~5.2 GB)
pip install --user huggingface_hub
hf download bamb00boy/gemma4-e2b-int4-executorch-pi5 --local-dir ~/gemma4

# 2. Set up the runtime environment
cd ~/gemma4
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install torch==2.11.0 executorch==1.2.0 transformers==5.5.3

# 3. Verify (should print "RESULT: PASS" and "The capital of France is **Paris**.")
python pi_runner.py --verify

# 4a. One-shot generation
python pi_runner.py "Your prompt here" --max-new-tokens 50

# 4b. Or an interactive multi-turn chat (KV-cache reused across turns)
python gemma4_terminal_chat.py
# Type a message + Enter. /help for commands. Ctrl+C or Ctrl+D to exit.

The Pi setup guide (OS install, performance tuning, SSH) lives in docs/pi5_setup.md in the source repo.

Use on other hosts

The .pte runs on any host with ExecuTorch 1.2.0 + XNNPACK. It has been validated on:

aarch64 Linux (Raspberry Pi 5, Ubuntu Server 24.04)
macOS arm64 (Apple Silicon, used as the development reference)

x86_64 Linux is expected to work (XNNPACK supports it) but is untested by this project.

What's quantized, what's not

Component	Treatment
`nn.Linear` weights (~3.1 B params)	INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk)
`embed_tokens_per_layer` (~2.35 B params, the "E2B" trick)	INT8 per-row via a custom `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`)
`embed_tokens` (~0.4 B params)	FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers
Layer norms, RoPE buffers, biases	FP32
Runtime K/V cache	FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`)

Disk size: 5.14 GB. Runtime cache footprint: 18.9 MB across 15 layers (12 sliding-window @ head_dim=256, 3 full-attention @ head_dim=512).

Architecture & shape constraints

Sequence length: padded to 511 tokens at runtime (the .pte shape-specializes to the upper bound of the dynamic dim).
Decode: token-by-token (no batched prefill in this build).
Maximum total context: 511 tokens (prompt + generated combined).
Cache: externalized — 90 cache tensors are passed as graph inputs and 45 are returned as graph outputs each call (one K + one V per layer × 15 layers + sentinel for prefill vs decode).

License

The weights in this file are derived from google/gemma-4-e2b-it and are licensed under Apache License 2.0 by Google DeepMind. Use is subject to:

Apache License 2.0 (text included in this repo as LICENSE)
Gemma Prohibited Use Policy
Gemma 4 Apache 2.0 announcement

This is a derivative work: INT4 weight-only quantization of nn.Linear weights and INT8 per-row quantization of embed_tokens_per_layer, followed by ExecuTorch program lowering with the XNNPACK backend. No additional fine-tuning has been performed.

The packaging code (pi_runner.py, gemma4_terminal_chat.py, and the export/quantize/lower pipeline) is released under MIT — see the source GitHub repo.

Attribution

Original Gemma 4 weights © Google DeepMind, released under Apache 2.0.
INT4 quantization + ExecuTorch lowering: derivative work by the
Gemma4_executorch_deployment contributors (https://github.com/bamb00boy/Gemma4_executorch_deployment).

Citation

If this artifact is useful in research, please cite both the original Gemma 4 release and this packaging:

@misc{gemma4-e2b-int4-executorch-pi5,
  title  = {Gemma 4 E2B INT4 ExecuTorch for Raspberry Pi 5},
  author = {bamb00boy and Gemma4_executorch_deployment contributors},
  year   = {2026},
  url    = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
  note   = {Source repository: https://github.com/bamb00boy/Gemma4_executorch_deployment}
}

For the upstream Gemma 4 model, see google/gemma-4-e2b-it.

Downloads last month: 118

bamb00boy
/

gemma4-e2b-int4-executorch-pi5

Gemma 4 E2B — INT4 ExecuTorch `.pte` for Raspberry Pi 5

Contents

Measured performance

Quick use on a Raspberry Pi 5

Use on other hosts

What's quantized, what's not

Architecture & shape constraints

License

Attribution

Citation

Gemma 4 E2B — INT4 ExecuTorch .pte for Raspberry Pi 5

Contents

Measured performance

Quick use on a Raspberry Pi 5

Use on other hosts

What's quantized, what's not

Architecture & shape constraints

License

Attribution

Citation

Gemma 4 E2B — INT4 ExecuTorch `.pte` for Raspberry Pi 5