Correct compression ratio 6.7x -> 6.4x

da72278 verified 5 days ago

5.3 kB

	---
	license: mit
	base_model:
	- google/gemma-4-E2B-it
	base_model_relation: quantized
	library_name: mlx
	pipeline_tag: image-text-to-text
	tags:
	- mlx
	- gemma
	- gemma-4
	- edge
	- on-device
	- apple-silicon
	- quantization
	- gptq
	- aqlm
	- ple
	language:
	- en
	- multilingual
	---

	# TheStageAI/gemma-4-E2B-it

	A compressed, edge-ready variant of Google's Gemma 4 E2B (instruction-tuned), packaged for
	[MLX](https://github.com/ml-explore/mlx) on Apple Silicon Macs and iPhones. The checkpoint is
	roughly 7× smaller than the original (fits in ~1.4 GB) while preserving the capabilities that
	matter most for on-device assistants: general world knowledge, instruction following, and tool use.

	- Run it with: [`TheStageAI/edge-lm`](https://github.com/TheStageAI/edge-lm)
	- Base model: [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
	- Sibling release: [`TheStageAI/gemma-4-E4B-it`](https://huggingface.co/TheStageAI/gemma-4-E4B-it)
	- Write-up: 7× size reduction for Gemma 4 Edge models — Compressing PLE architectures.

	## Why this exists

	Gemma 4 E2B is a "2B" model by effective parameter count, but the dense checkpoint is closer to
	5.1B parameters once Per-Layer Embeddings (PLE) are counted — and in BF16 the PLE table alone is
	~4.7 GB. On mobile hardware, three things block deployment: download size, runtime memory footprint
	(iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural
	structure to address all three at once.

	## How it was compressed

	- Transformer blocks — GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted
	as flat, MLX-compatible per-group weight-only tensors.
	- PLE tables — an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style)
	assignments. The ~4.7 GB BF16 PLE compresses to ~0.26 GB (indices + codebooks), decompressed on
	the fly with a single batched gather across all layers.
	- Token embeddings / LM head — flat per-group scalar quantization matched to the same runtime contract.
	- Bit-width schedule — chosen per module by Riemannian Constrained Optimization (RCO) under an exact
	byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass.

	## Operating points

	This repo ships two release operating points, selected via the `size` argument:

	\| `size` \| Trade-off \| Compression \|
	\|---\|---\|---\|
	\| `l` \| More quality, larger artifact \| 5.62× \|
	\| `m` \| Smaller headline target (default) \| 6.40× \|

	It also includes optional 4-bit vision and audio towers for image understanding and audio transcription.

	## Usage

	```bash
	git clone https://github.com/TheStageAI/edge-lm.git
	pip install -e edge-lm
	```

	```python
	from edge_lm import load
	from mlx_vlm import stream_generate

	model, tokenizer = load("TheStageAI/gemma-4-E2B-it", size="m")

	prompt = tokenizer.apply_chat_template(
	[{"role": "user", "content": "Explain gravity in one sentence."}],
	tokenize=False, add_generation_prompt=True,
	)
	for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
	print(chunk.text, end="", flush=True)
	```

	Vision and audio (loads the optional towers):

	```python
	model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_vision=True) # image understanding
	model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_audio=True) # audio transcription
	```

	Only the files needed for the requested size are downloaded.

	## Benchmarks

	Every model — ours and the GGUF baselines — is dequantized to a standard BF16 checkpoint and served
	through vLLM, so the backend is equalized. We report MMLU-Pro (general knowledge), IFEval
	(instruction following), and τ²-Bench / Tau2 (multi-step tool use). For Tau2 the Gemma checkpoint
	acts as the agent while a fixed `Qwen3-235B-A22B-2507` simulates the user.

	\| Model \| Compression \| MMLU-Pro \| IFEval \| Tau2 (avg of 3) \|
	\|---\|---\|---\|---\|---\|
	\| BF16 (reference) \| 1.00× \| 61.85 \| 74.68 \| 30.67 \|
	\| Ours L \| 5.62× \| 54.48 \| 74.86 \| 22.20 \|
	\| Ours M \| 6.40× \| 49.85 \| 71.53 \| 23.45 \|
	\| Unsloth Q3-K-S \| 3.81× \| 48.20 \| 64.51 \| 18.69 \|
	\| Unsloth UD-Q2-K-XL \| 3.87× \| 43.17 \| 66.54 \| 20.23 \|

	Bold marks the best result among the compressed checkpoints in each column.

	## Files

	\| File \| Contents \|
	\|---\|---\|
	\| `config.json` \| Shared model config (architecture) \|
	\| `model_{s,m,l}.safetensors` \| Quantized decoder weights per operating point (quantization map in metadata) \|
	\| `ple_{s,m,l}.safetensors` \| Compact AQLM PLE codes + codebooks \|
	\| `vision_tower.safetensors` \| Optional 4-bit vision tower \|
	\| `audio_tower.safetensors` \| Optional 4-bit audio tower \|
	\| `tokenizer.json`, `tokenizer_config.json` \| Tokenizer \|

	## License

	Released under the [MIT License](https://github.com/TheStageAI/edge-lm/blob/main/LICENSE),
	© 2025 thestage.ai labs. As a derivative of Google's Gemma 4, the weights are additionally subject
	to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).

	## Citation

	If you use these checkpoints, please cite the Gemma 4 release and the methods we build on
	(GPTQ, QEP, AQLM, RCO) — see the references in the [edge-lm](https://github.com/TheStageAI/edge-lm)
	write-up.