--- license: mit base_model: - google/gemma-4-E4B-it base_model_relation: quantized library_name: mlx pipeline_tag: image-text-to-text tags: - mlx - gemma - gemma-4 - edge - on-device - apple-silicon - quantization - gptq - aqlm - ple language: - en - multilingual --- # TheStageAI/gemma-4-E4B-it A compressed, edge-ready variant of Google's **Gemma 4 E4B (instruction-tuned)**, packaged for [MLX](https://github.com/ml-explore/mlx) on Apple Silicon Macs and iPhones. The checkpoint fits in **~2.6 GB** — small enough to download quickly and stay within mobile memory budgets — while preserving the capabilities that matter most for on-device assistants: general world knowledge, instruction following, and tool use. - **Run it with:** [`TheStageAI/edge-lm`](https://github.com/TheStageAI/edge-lm) - **Base model:** [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) - **Sibling release:** [`TheStageAI/gemma-4-E2B-it`](https://huggingface.co/TheStageAI/gemma-4-E2B-it) - **Write-up:** *7× size reduction for Gemma 4 Edge models — Compressing PLE architectures.* ## Why this exists Gemma 4 E4B is a "4B" model by *effective* parameter count, but the dense checkpoint is closer to **8B** parameters once Per-Layer Embeddings (PLE) are counted — and in BF16 the PLE table dominates the footprint. On mobile hardware, three things block deployment: download size, runtime memory footprint (iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural structure to address all three at once. ## How it was compressed - **Transformer blocks** — GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as flat, MLX-compatible per-group weight-only tensors. - **PLE tables** — an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style) assignments, decompressed on the fly with a single batched gather across all layers. - **Token embeddings / LM head** — flat per-group scalar quantization matched to the same runtime contract. - **Bit-width schedule** — chosen per module by Riemannian Constrained Optimization (RCO) under an exact byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass. ## Operating points This repo ships two release operating points, selected via the `size` argument: | `size` | Trade-off | Compression | |---|---|---| | `l` | More quality, larger artifact | 4.64× | | `m` | Smaller headline target (**default**) | **5.60×** | It also includes optional 4-bit vision and audio towers for image understanding and audio transcription. ## Usage ```bash git clone https://github.com/TheStageAI/edge-lm.git pip install -e edge-lm ``` ```python from edge_lm import load from mlx_vlm import stream_generate model, tokenizer = load("TheStageAI/gemma-4-E4B-it", size="l") # use "m" for the smaller target prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Explain gravity in one sentence."}], tokenize=False, add_generation_prompt=True, ) for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128): print(chunk.text, end="", flush=True) ``` Vision and audio (loads the optional towers): ```python model, tokenizer = load("TheStageAI/gemma-4-E4B-it", include_vision=True) # image understanding model, tokenizer = load("TheStageAI/gemma-4-E4B-it", include_audio=True) # audio transcription ``` Only the files needed for the requested size are downloaded. ## Benchmarks Every model — ours and the GGUF baselines — is dequantized to a standard BF16 checkpoint and served through vLLM, so the backend is equalized. We report **MMLU-Pro** (general knowledge), **IFEval** (instruction following), and **τ²-Bench / Tau2** (multi-step tool use). For Tau2 the Gemma checkpoint acts as the agent while a fixed `Qwen3-235B-A22B-2507` simulates the user. | Model | Compression | MMLU-Pro | IFEval | Tau2 | |---|---|---|---|---| | BF16 (reference) | 1.00× | 70.49 | 81.33 | 37.19 | | **Ours L** | 4.64× | 67.41 | 81.52 | **33.25** | | **Ours M** | **5.60×** | 63.54 | **80.78** | 29.04 | | Unsloth Q3-K-S | 3.90× | **63.66** | 77.08 | 30.47 | | Unsloth UD-Q2-K-XL | 4.01× | 58.69 | 79.67 | 22.91 | Bold marks the best result among the compressed checkpoints in each column. ## Files | File | Contents | |---|---| | `config.json` | Shared model config (architecture) | | `model_{s,m,l}.safetensors` | Quantized decoder weights per operating point (quantization map in metadata) | | `ple_{s,m,l}.safetensors` | Compact AQLM PLE codes + codebooks | | `vision_tower.safetensors` | Optional 4-bit vision tower | | `audio_tower.safetensors` | Optional 4-bit audio tower | | `tokenizer.json`, `tokenizer_config.json` | Tokenizer | ## License Released under the [MIT License](https://github.com/TheStageAI/edge-lm/blob/main/LICENSE), © 2025 thestage.ai labs. As a derivative of Google's Gemma 4, the weights are additionally subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). ## Citation If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) — see the references in the [edge-lm](https://github.com/TheStageAI/edge-lm) write-up.