Gemma 4 E2B โ€” Q6_K GGUF

6-bit quantized GGUF version of google/gemma-4-e2b-it.
Near-lossless quantization โ€” 94% Top-1 agreement with F16 at a fraction of the size.

Other quantizations in this series:
Q2_K ยท Q3_K_S ยท Q3_K_M ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M ยท Q8


File Info

Property Value
Format GGUF Q6_K
File size 3.58 GB
Bits per weight ~6
Size vs F16 2.4ร— smaller

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.

Results by Category

Category Speed (tok/s) SQNR Top-1 Agreement KL Divergence
๐Ÿ”ข Math 19.7 28.7 dB 94.3% 0.0796
๐Ÿง  Logic 19.9 29.3 dB 93.2% 0.0891
๐Ÿ’ป Code 20.0 29.0 dB 93.6% 0.0502
๐Ÿ”ฌ Science 19.9 27.8 dB 95.1% 0.0784
Overall 19.9 28.72 dB 94.1% 0.0743

Quantization Comparison

Model Size Speed (tok/s) vs F16 speed SQNR Top-1 Agree KL Div
F16 (baseline) 8.67 GB 5.7 1.0ร— baseline baseline baseline
Q4_K_M 3.19 GB 24.0 4.2ร— 20.33 dB 82.4% 0.3356
Q5_K_M 3.38 GB 22.0 3.9ร— 23.25 dB 86.9% 0.1248
Q6_K (this) 3.58 GB 19.9 3.5ร— 28.72 dB 94.1% 0.0743
Q8 4.63 GB 16.2 2.9ร— 37.11 dB 96.0% 0.0171

Key Findings

  • Quality: 94.1% Top-1 agreement โ€” crosses the "near-identical to F16" threshold; only Q8 is better
  • SQNR: 28.72 dB โ€” a substantial 5.5 dB jump over Q5_K_M; outputs are essentially indistinguishable from F16 in practice
  • Speed: 19.9 tok/s โ€” 3.5ร— faster than F16
  • Size: 3.58 GB โ€” only 1 GB more than Q4_K_M for a major quality improvement
  • vs Q8: Q6_K is 1.05 GB smaller and 3.7 tok/s faster, with only a small quality difference (94.1 vs 96.0% Top-1)
  • Best for: Quality-sensitive tasks where you want near-F16 fidelity but still need to fit under ~5 GB; scientific explanation, precise reasoning, complex multi-step code

Usage

# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q6k.gguf -p "Explain the difference between supervised and reinforcement learning." -n 200
# llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-e2b-q6k.gguf", n_ctx=2048)
output = llm("Explain the difference between supervised and reinforcement learning.", max_tokens=200)
print(output["choices"][0]["text"])

Hardware

Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding

Downloads last month
321
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support