Gemma4-E2B-W4A16

INT4 post-training quantization of google/gemma-4-e2b-it β€” Google's 2B-active multimodal MoE with 128 experts. The smallest capable multimodal MoE. Runs on any 8 GB GPU.

Quantized with AutoRound (iters=200, SignSGD rounding optimization) via llm-compressor. The LLM backbone is W4A16; vision tower, projector, and PLE layers remain BF16.


At a Glance

Property Value
Base model google/gemma-4-e2b-it
Architecture Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant method AutoRound, iters=200
Quant scheme W4A16 (4-bit weights, 16-bit activations)
Quant format compressed-tensors (native vLLM)
Quantized language_model.* β€” all Linear layers (attn + MLP)
Kept BF16 vision_tower, audio_tower, multi_modal_projector, embed_tokens_per_layer (PLE), per_layer_model_projection (PLE), lm_head, norms, embeddings
Disk size ~6 GB
Min GPU 1Γ— RTX 3080 10GB / RTX 4070

PLE layers kept at BF16

embed_tokens_per_layer and per_layer_model_projection implement Per-Layer Embeddings β€” ablations show catastrophic output degradation if quantized. Always excluded.


Memory Requirements

Configuration BF16 This Quant (W4A16)
Weights (disk/VRAM) ~14 GB ~6 GB
KV cache @ 32k ctx (fp8) ~1.0 GB ~1.0 GB
Total @ 32k ctx ~15 GB ~7 GB
Minimum GPU RTX 3090 24GB 1Γ— 8GB GPU (RTX 3080 10GB / RTX 4070)

The smallest capable multimodal MoE: W4A16 fits the model on any modern 8 GB GPU with room for KV cache.


Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β€” vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format β€” no --quantization flag needed.

Python client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

response = client.chat.completions.create(
    model="88plug/Gemma4-E2B-W4A16",
    messages=[{"role": "user", "content": "Explain sparse mixture-of-experts in two sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Quantization Design

The recipe targets all Linear modules in the LLM backbone with W4A16 (4-bit symmetric weight quantization, activations remain BF16). The following are excluded and kept at BF16:

Excluded pattern Reason
lm_head Output projection β€” quality-sensitive
.*embed_tokens$ Token embeddings
.*norm$ Layer norms
.*embed_tokens_per_layer.* PLE: per-layer token embeddings β€” catastrophic if quantized
.*per_layer_model_projection.* PLE: projection into hidden dim β€” catastrophic if quantized
.*vision_tower.* SigLIP vision encoder β€” multimodal quality
.*audio_tower.* Audio encoder β€” multimodal quality
.*multi_modal_projector.* Cross-modal projector

All self_attn.{q,k,v,o}_proj and mlp.{gate,up,down}_proj layers across all transformer blocks are quantized to W4A16.

Calibration: 1024 samples β€” 512 from HuggingFaceH4/ultrachat_200k (chat) + 512 from wikitext-103-raw-v1 (text), max sequence length 2048.


Competitor Comparables

Model Source Format Compare angle
google/gemma-4-e2b-it official BF16 quality ceiling
RedHatAI/gemma-3n-E4B-it-quantized.w4a16 RedHatAI compressed-tensors W4A16 same format, prior generation, larger model
88plug/Gemma4-E2B-W8A16 88plug compressed-tensors W8A16 higher precision variant

First-to-market note: No compressed-tensors or vLLM-native quant found for gemma-4-e2b-it at release time. This is the first W4A16 for Gemma4 E2B.


Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W4A16 compressed-tensors 1 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 compressed-tensors 8 32k β€” β€” β€” β€”
SGLang v0.5.8 BF16 (baseline) 1 32k β€” β€” β€” β€”
llama.cpp b9297 Q8_0 GGUF 1 32k β€” β€” β€” β€”
llama.cpp b9297 IQ4_XS GGUF 1 32k β€” β€” β€” β€”

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Quality Targets

Metric Target
KL divergence vs BF16 < 0.014
MMLU recovery β‰₯ 99%

SGLang Note

SGLang does not natively support compressed-tensors weights. To use SGLang, run the BF16 base model (google/gemma-4-e2b-it) directly:

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

SGLang benchmark results above reflect BF16 baseline throughput, not this quant.


llama.cpp / GGUF

Fits entirely on an 8 GB GPU with Q4 quantization. Convert from the BF16 base checkpoint β€” not from compressed-tensors weights. VLM requires a separate mmproj GGUF for image input.

python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Citation

@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β€” built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 β€” INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 β€” AutoRound with iters=200 and a mixed calibration corpus. Targets β‰₯ 99% MMLU recovery β€” the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Gemma4-E2B-it-W8A16 (INT8, ~7 GB) Β· Gemma4-E2B-it-W4A16 (INT4, ~6 GB)

Browse all releases β†’ huggingface.co/88plug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results