Gemma 4 E2B it โ€” Q8 GGUF

8-bit quantized GGUF version of google/gemma-4-e2b-it.
Highest quality quantization โ€” 96% Top-1 agreement with F16, effectively lossless.

Other quantizations in this series:
Q2_K ยท Q3_K_S ยท Q3_K_M ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M


File Info

Property Value
Format GGUF Q8
File size 4.63 GB
Bits per weight ~8
Size vs F16 1.9ร— smaller

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.

Results by Category

Category Speed (tok/s) SQNR Top-1 Agreement KL Divergence
๐Ÿ”ข Math 16.2 37.1 dB 95.7% 0.0151
๐Ÿง  Logic 16.2 36.8 dB 96.6% 0.0166
๐Ÿ’ป Code 16.3 37.8 dB 97.4% 0.0155
๐Ÿ”ฌ Science 16.3 36.7 dB 94.2% 0.0209
Overall 16.2 37.11 dB 96.0% 0.0171

Quantization Comparison

Model Size Speed (tok/s) vs F16 speed SQNR Top-1 Agree KL Div
F16 (baseline) 8.67 GB 5.7 1.0ร— baseline baseline baseline
Q4_K_M 3.19 GB 24.0 4.2ร— 20.33 dB 82.4% 0.3356
Q5_K_M 3.38 GB 22.0 3.9ร— 23.25 dB 86.9% 0.1248
Q6_K 3.58 GB 19.9 3.5ร— 28.72 dB 94.1% 0.0743
Q8 (this) 4.63 GB 16.2 2.9ร— 37.11 dB 96.0% 0.0171

Key Findings

  • Quality: 37.11 dB SQNR and KL divergence of just 0.017 โ€” for all practical purposes, identical to F16
  • Top-1 Agreement: 96.0% โ€” the model picks the same token as F16 96 times out of 100
  • Speed: 16.2 tok/s โ€” still 2.9ร— faster than F16, just slower than lower-bit quants
  • Size: 4.63 GB โ€” fits in 6 GB RAM; half the size of F16
  • vs Q6_K: +8.4 dB SQNR and +1.9% Top-1 for 1.05 GB extra; worth it if you have the RAM
  • Best for: Maximum quality with reasonable size reduction; production deployments where output must match F16 as closely as possible

Usage

# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q8.gguf -p "Explain how a transformer neural network works." -n 200
# llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-e2b-q8.gguf", n_ctx=2048)
output = llm("Explain how a transformer neural network works.", max_tokens=200)
print(output["choices"][0]["text"])

Hardware

Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding

Downloads last month
373
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support