Gemma4-E2B-W8A16

INT8 post-training quantization of google/gemma-4-e2b-it — Google's 2B-active multimodal MoE with 128 experts. The smallest capable multimodal MoE. Runs on any 8 GB GPU.


At a Glance

Property Value
Base model google/gemma-4-e2b-it
Architecture Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant format compressed-tensors (native vLLM)
Quant method AutoRound W8A16 (RTN, datafree)
Quantized language_model.* transformer layers
Kept BF16 vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE)
Min GPU 1× RTX 3080 10GB / RTX 4070

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.

SGLang

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W8A16 1 32k
vLLM v0.21.0 W8A16 8 32k
SGLang v0.5.8 BF16 (baseline) 1 32k
llama.cpp b9297 Q8_0 GGUF 1 32k
llama.cpp b9297 IQ4_XS GGUF 1 32k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Quality Targets

Metric Target
KL divergence vs BF16 < 0.005
MMLU recovery ≥ 99.7%

vs. Other Gemma4-E2B Quants

This is the first compressed-tensors W8A16 checkpoint for Gemma4-E2B. At ~2.5 GB it is the smallest vLLM-native multimodal checkpoint that fits on consumer 8 GB GPUs.

Quant Method Size GPU Compatibility Notes
88plug W8A16 (this) compressed-tensors RTN W8A16 ~2.5 GB Any Ampere+ ≥8 GB First W8A16; native vLLM; vision+text
BF16 baseline None ~4.5 GB 1× RTX 3080 10GB Reference
Community GGUF Q4_K_M llama.cpp GGUF ~2.5 GB CPU / any GPU Vision requires mmproj GGUF
Community GGUF Q8_0 llama.cpp GGUF ~4.5 GB Any GPU ≥6 GB Near-lossless; vision requires mmproj

Limitations

  • Vision tower excluded: SigLIP vision encoder stays BF16 — RTN INT8 not applied to vision components.
  • PLE layers excluded: embed_tokens_per_layer and per_layer_model_projection (Per-Layer Embeddings) kept at BF16 to prevent catastrophic quality loss.
  • RTN (data-free) quantization: No calibration corpus used. W8A16 RTN is near-lossless but has not been AutoRound-calibrated.
  • Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Gemma4-E2B-it-W4A16 (INT4, ~6 GB) · Gemma4-E2B-it-W8A16 (INT8, ~7 GB)

Browse all releases → huggingface.co/88plug

Downloads last month
-
Safetensors
Model size
6B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results