bamb00boy
/

gemma4-e2b-int4-executorch-pi5

Text Generation

Model card Files Files and versions

bamb00boy commited on 1 day ago

Commit

6de8bd2

·

verified ·

1 Parent(s): 440ab62

Update model card

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -17,9 +17,9 @@ pipeline_tag: text-generation
 INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for **Raspberry Pi 5 (Cortex-A76, 8 GB)** deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.
-This artifact is the deployable output of the full export → quantize → lower → runtime recipe documented at:
-**Source code & full recipe:** https://github.com/bamb00boy/Gemma4_executorch_deployment
 ## Contents
@@ -89,7 +89,7 @@ x86_64 Linux is expected to work (XNNPACK supports it) but is untested by this p
 | Component | Treatment |
 |---|---|
 | `nn.Linear` weights (~3.1 B params) | INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) |
-| `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) | INT8 per-row via a hand-rolled `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`) |
 | `embed_tokens` (~0.4 B params) | FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers |
 | Layer norms, RoPE buffers, biases | FP32 |
 | Runtime K/V cache | FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) |
@@ -133,7 +133,7 @@ If this artifact is useful in research, please cite both the original Gemma 4 re
   author = {bamb00boy and Gemma4_executorch_deployment contributors},
   year   = {2026},
   url    = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
-  note   = {Source recipe: https://github.com/bamb00boy/Gemma4_executorch_deployment}
 }
 ```

 INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for **Raspberry Pi 5 (Cortex-A76, 8 GB)** deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.
+This artifact is the deployable output of the full export → quantize → lower → runtime pipeline documented at:
+**Source code & documentation:** https://github.com/bamb00boy/Gemma4_executorch_deployment
 ## Contents
 | Component | Treatment |
 |---|---|
 | `nn.Linear` weights (~3.1 B params) | INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) |
+| `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) | INT8 per-row via a custom `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`) |
 | `embed_tokens` (~0.4 B params) | FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers |
 | Layer norms, RoPE buffers, biases | FP32 |
 | Runtime K/V cache | FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) |
   author = {bamb00boy and Gemma4_executorch_deployment contributors},
   year   = {2026},
   url    = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
+  note   = {Source repository: https://github.com/bamb00boy/Gemma4_executorch_deployment}
 }
 ```