Gemma 4 E2B-it — MLX 4-bit Quantized

MLX-compatible 4-bit quantized version of google/gemma-4-E2B-it, converted for Apple Silicon inference via mlx-lm.

  • Original model: google/gemma-4-E2B-it (9.6 GB bf16)
  • Quantized size: 2.61 GB (4-bit, group_size=64)
  • Bits per weight: 6.399
  • Performance: ~35 tokens/sec on Apple Silicon

Usage

Note: Requires gemma4.py model support in mlx-lm. See PR #1095, or manually place gemma4.py in your mlx_lm/models/ directory.

from mlx_lm import load, generate

model, tokenizer = load("avinashmohan/gemma-4-E2B-it-4bit-mlx")

messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)
# The capital of France is **Paris**.

Architecture highlights

Gemma 4 introduces several features over Gemma 3:

  • Per-Layer Embeddings (PLE): each decoder layer receives its own token embedding, gated and projected into the residual stream
  • KV cache sharing: the last 20 of 35 layers reuse KV caches from earlier layers, reducing memory
  • ProportionalRoPE: global attention uses partial rotation (25% of dims), the rest pass through
  • Heterogeneous head dims: sliding attention (head_dim=256) vs global attention (global_head_dim=512)
  • v_norm: RMS normalization on value states (without learned scale)
  • RMSNorm without +1 offset: unlike Gemma 3's (1 + weight), Gemma 4 uses plain weight
  • Attention scale = 1.0: QK-norm replaces traditional 1/sqrt(d) scaling

Conversion

Converted using a custom script that reads raw bf16 bytes via memory-mapped I/O, splits the oversized PLE embedding tensor (4.7 GB, exceeds Metal's 4 GB buffer limit) into per-layer chunks, and quantizes via mlx.nn.quantize.

Downloads last month
899
Safetensors
Model size
0.7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avinashmohan/gemma-4-E2B-it-4bit-mlx

Finetuned
(20)
this model