Update model card
Browse files
README.md
CHANGED
|
@@ -17,9 +17,9 @@ pipeline_tag: text-generation
|
|
| 17 |
|
| 18 |
INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for **Raspberry Pi 5 (Cortex-A76, 8 GB)** deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.
|
| 19 |
|
| 20 |
-
This artifact is the deployable output of the full export → quantize → lower → runtime
|
| 21 |
|
| 22 |
-
**Source code &
|
| 23 |
|
| 24 |
## Contents
|
| 25 |
|
|
@@ -89,7 +89,7 @@ x86_64 Linux is expected to work (XNNPACK supports it) but is untested by this p
|
|
| 89 |
| Component | Treatment |
|
| 90 |
|---|---|
|
| 91 |
| `nn.Linear` weights (~3.1 B params) | INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) |
|
| 92 |
-
| `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) | INT8 per-row via a
|
| 93 |
| `embed_tokens` (~0.4 B params) | FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers |
|
| 94 |
| Layer norms, RoPE buffers, biases | FP32 |
|
| 95 |
| Runtime K/V cache | FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) |
|
|
@@ -133,7 +133,7 @@ If this artifact is useful in research, please cite both the original Gemma 4 re
|
|
| 133 |
author = {bamb00boy and Gemma4_executorch_deployment contributors},
|
| 134 |
year = {2026},
|
| 135 |
url = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
|
| 136 |
-
note = {Source
|
| 137 |
}
|
| 138 |
```
|
| 139 |
|
|
|
|
| 17 |
|
| 18 |
INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for **Raspberry Pi 5 (Cortex-A76, 8 GB)** deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.
|
| 19 |
|
| 20 |
+
This artifact is the deployable output of the full export → quantize → lower → runtime pipeline documented at:
|
| 21 |
|
| 22 |
+
**Source code & documentation:** https://github.com/bamb00boy/Gemma4_executorch_deployment
|
| 23 |
|
| 24 |
## Contents
|
| 25 |
|
|
|
|
| 89 |
| Component | Treatment |
|
| 90 |
|---|---|
|
| 91 |
| `nn.Linear` weights (~3.1 B params) | INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) |
|
| 92 |
+
| `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) | INT8 per-row via a custom `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`) |
|
| 93 |
| `embed_tokens` (~0.4 B params) | FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers |
|
| 94 |
| Layer norms, RoPE buffers, biases | FP32 |
|
| 95 |
| Runtime K/V cache | FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) |
|
|
|
|
| 133 |
author = {bamb00boy and Gemma4_executorch_deployment contributors},
|
| 134 |
year = {2026},
|
| 135 |
url = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
|
| 136 |
+
note = {Source repository: https://github.com/bamb00boy/Gemma4_executorch_deployment}
|
| 137 |
}
|
| 138 |
```
|
| 139 |
|