gemma-4-E2B-it / README.md
quazim's picture
Correct compression ratio 6.7x -> 6.4x
da72278 verified
---
license: mit
base_model:
- google/gemma-4-E2B-it
base_model_relation: quantized
library_name: mlx
pipeline_tag: image-text-to-text
tags:
- mlx
- gemma
- gemma-4
- edge
- on-device
- apple-silicon
- quantization
- gptq
- aqlm
- ple
language:
- en
- multilingual
---
# TheStageAI/gemma-4-E2B-it
A compressed, edge-ready variant of Google's **Gemma 4 E2B (instruction-tuned)**, packaged for
[MLX](https://github.com/ml-explore/mlx) on Apple Silicon Macs and iPhones. The checkpoint is
roughly **7× smaller** than the original (fits in **~1.4 GB**) while preserving the capabilities that
matter most for on-device assistants: general world knowledge, instruction following, and tool use.
- **Run it with:** [`TheStageAI/edge-lm`](https://github.com/TheStageAI/edge-lm)
- **Base model:** [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
- **Sibling release:** [`TheStageAI/gemma-4-E4B-it`](https://huggingface.co/TheStageAI/gemma-4-E4B-it)
- **Write-up:** *7× size reduction for Gemma 4 Edge models — Compressing PLE architectures.*
## Why this exists
Gemma 4 E2B is a "2B" model by *effective* parameter count, but the dense checkpoint is closer to
**5.1B** parameters once Per-Layer Embeddings (PLE) are counted — and in BF16 the PLE table alone is
~4.7 GB. On mobile hardware, three things block deployment: download size, runtime memory footprint
(iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural
structure to address all three at once.
## How it was compressed
- **Transformer blocks** — GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted
as flat, MLX-compatible per-group weight-only tensors.
- **PLE tables** — an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style)
assignments. The ~4.7 GB BF16 PLE compresses to **~0.26 GB** (indices + codebooks), decompressed on
the fly with a single batched gather across all layers.
- **Token embeddings / LM head** — flat per-group scalar quantization matched to the same runtime contract.
- **Bit-width schedule** — chosen per module by Riemannian Constrained Optimization (RCO) under an exact
byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass.
## Operating points
This repo ships two release operating points, selected via the `size` argument:
| `size` | Trade-off | Compression |
|---|---|---|
| `l` | More quality, larger artifact | 5.62× |
| `m` | Smaller headline target (**default**) | **6.40×** |
It also includes optional 4-bit vision and audio towers for image understanding and audio transcription.
## Usage
```bash
git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm
```
```python
from edge_lm import load
from mlx_vlm import stream_generate
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", size="m")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Explain gravity in one sentence."}],
tokenize=False, add_generation_prompt=True,
)
for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
print(chunk.text, end="", flush=True)
```
Vision and audio (loads the optional towers):
```python
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_vision=True) # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_audio=True) # audio transcription
```
Only the files needed for the requested size are downloaded.
## Benchmarks
Every model — ours and the GGUF baselines — is dequantized to a standard BF16 checkpoint and served
through vLLM, so the backend is equalized. We report **MMLU-Pro** (general knowledge), **IFEval**
(instruction following), and **τ²-Bench / Tau2** (multi-step tool use). For Tau2 the Gemma checkpoint
acts as the agent while a fixed `Qwen3-235B-A22B-2507` simulates the user.
| Model | Compression | MMLU-Pro | IFEval | Tau2 (avg of 3) |
|---|---|---|---|---|
| BF16 (reference) | 1.00× | 61.85 | 74.68 | 30.67 |
| **Ours L** | 5.62× | **54.48** | **74.86** | 22.20 |
| **Ours M** | **6.40×** | 49.85 | 71.53 | **23.45** |
| Unsloth Q3-K-S | 3.81× | 48.20 | 64.51 | 18.69 |
| Unsloth UD-Q2-K-XL | 3.87× | 43.17 | 66.54 | 20.23 |
Bold marks the best result among the compressed checkpoints in each column.
## Files
| File | Contents |
|---|---|
| `config.json` | Shared model config (architecture) |
| `model_{s,m,l}.safetensors` | Quantized decoder weights per operating point (quantization map in metadata) |
| `ple_{s,m,l}.safetensors` | Compact AQLM PLE codes + codebooks |
| `vision_tower.safetensors` | Optional 4-bit vision tower |
| `audio_tower.safetensors` | Optional 4-bit audio tower |
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer |
## License
Released under the [MIT License](https://github.com/TheStageAI/edge-lm/blob/main/LICENSE),
© 2025 thestage.ai labs. As a derivative of Google's Gemma 4, the weights are additionally subject
to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
## Citation
If you use these checkpoints, please cite the Gemma 4 release and the methods we build on
(GPTQ, QEP, AQLM, RCO) — see the references in the [edge-lm](https://github.com/TheStageAI/edge-lm)
write-up.