Gemma 4 E2B-it — MLX 4-bit Quantized
MLX-compatible 4-bit quantized version of google/gemma-4-E2B-it, converted for Apple Silicon inference via mlx-lm.
- Original model: google/gemma-4-E2B-it (9.6 GB bf16)
- Quantized size: 2.61 GB (4-bit, group_size=64)
- Bits per weight: 6.399
- Performance: ~35 tokens/sec on Apple Silicon
Usage
Note: Requires
gemma4.pymodel support in mlx-lm. See PR #1095, or manually placegemma4.pyin yourmlx_lm/models/directory.
from mlx_lm import load, generate
model, tokenizer = load("avinashmohan/gemma-4-E2B-it-4bit-mlx")
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)
# The capital of France is **Paris**.
Architecture highlights
Gemma 4 introduces several features over Gemma 3:
- Per-Layer Embeddings (PLE): each decoder layer receives its own token embedding, gated and projected into the residual stream
- KV cache sharing: the last 20 of 35 layers reuse KV caches from earlier layers, reducing memory
- ProportionalRoPE: global attention uses partial rotation (25% of dims), the rest pass through
- Heterogeneous head dims: sliding attention (head_dim=256) vs global attention (global_head_dim=512)
- v_norm: RMS normalization on value states (without learned scale)
- RMSNorm without +1 offset: unlike Gemma 3's
(1 + weight), Gemma 4 uses plainweight - Attention scale = 1.0: QK-norm replaces traditional
1/sqrt(d)scaling
Conversion
Converted using a custom script that reads raw bf16 bytes via memory-mapped I/O, splits the oversized PLE embedding tensor (4.7 GB, exceeds Metal's 4 GB buffer limit) into per-layer chunks, and quantizes via mlx.nn.quantize.
- Downloads last month
- 899
Model size
0.7B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for avinashmohan/gemma-4-E2B-it-4bit-mlx
Base model
google/gemma-4-E2B-it