bamb00boy commited on
Commit
6de8bd2
·
verified ·
1 Parent(s): 440ab62

Update model card

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -17,9 +17,9 @@ pipeline_tag: text-generation
17
 
18
  INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for **Raspberry Pi 5 (Cortex-A76, 8 GB)** deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.
19
 
20
- This artifact is the deployable output of the full export → quantize → lower → runtime recipe documented at:
21
 
22
- **Source code & full recipe:** https://github.com/bamb00boy/Gemma4_executorch_deployment
23
 
24
  ## Contents
25
 
@@ -89,7 +89,7 @@ x86_64 Linux is expected to work (XNNPACK supports it) but is untested by this p
89
  | Component | Treatment |
90
  |---|---|
91
  | `nn.Linear` weights (~3.1 B params) | INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) |
92
- | `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) | INT8 per-row via a hand-rolled `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`) |
93
  | `embed_tokens` (~0.4 B params) | FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers |
94
  | Layer norms, RoPE buffers, biases | FP32 |
95
  | Runtime K/V cache | FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) |
@@ -133,7 +133,7 @@ If this artifact is useful in research, please cite both the original Gemma 4 re
133
  author = {bamb00boy and Gemma4_executorch_deployment contributors},
134
  year = {2026},
135
  url = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
136
- note = {Source recipe: https://github.com/bamb00boy/Gemma4_executorch_deployment}
137
  }
138
  ```
139
 
 
17
 
18
  INT4-quantized, ExecuTorch-lowered `.pte` of [`google/gemma-4-e2b-it`](https://huggingface.co/google/gemma-4-e2b-it), packaged for **Raspberry Pi 5 (Cortex-A76, 8 GB)** deployment via the [ExecuTorch](https://pytorch.org/executorch) 1.2.0 Python runtime with the XNNPACK backend.
19
 
20
+ This artifact is the deployable output of the full export → quantize → lower → runtime pipeline documented at:
21
 
22
+ **Source code & documentation:** https://github.com/bamb00boy/Gemma4_executorch_deployment
23
 
24
  ## Contents
25
 
 
89
  | Component | Treatment |
90
  |---|---|
91
  | `nn.Linear` weights (~3.1 B params) | INT4 weight-only via torchao's `Int8DynamicActivationIntxWeightConfig` (stored unpacked as INT8 bytes on disk) |
92
+ | `embed_tokens_per_layer` (~2.35 B params, the "E2B" trick) | INT8 per-row via a custom `Int8Embedding` module (see source repo's `scripts/_int8_embedding.py`) |
93
  | `embed_tokens` (~0.4 B params) | FP32 — Gemma 4's model code performs direct weight slicing, which is incompatible with quantized tensor wrappers |
94
  | Layer norms, RoPE buffers, biases | FP32 |
95
  | Runtime K/V cache | FP32, externalized as program inputs/outputs (see source repo's `scripts/_external_cache.py`) |
 
133
  author = {bamb00boy and Gemma4_executorch_deployment contributors},
134
  year = {2026},
135
  url = {https://huggingface.co/bamb00boy/gemma4-e2b-int4-executorch-pi5},
136
+ note = {Source repository: https://github.com/bamb00boy/Gemma4_executorch_deployment}
137
  }
138
  ```
139