Gemma 4 E2B it โ€” Q2_K GGUF

2-bit quantized GGUF version of google/gemma-4-e2b-it.
Smallest and fastest variant in the series โ€” use only if RAM is the hard constraint.

Other quantizations in this series:
Q3_K_S ยท Q3_K_M ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M ยท Q6_K ยท Q8


File Info

Property Value
Format GGUF Q2_K
File size 2.78 GB
Bits per weight ~2
Size vs F16 3.1ร— smaller

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.

Results by Category

Category Speed (tok/s) SQNR Top-1 Agreement KL Divergence
๐Ÿ”ข Math 30.9 5.0 dB 35.3% 3.8922
๐Ÿง  Logic 31.7 5.7 dB 33.8% 4.1991
๐Ÿ’ป Code 31.8 6.7 dB 24.4% 4.4969
๐Ÿ”ฌ Science 32.0 6.4 dB 34.3% 3.8713
Overall 31.6 5.85 dB 32.0% 4.1149

Quantization Comparison

Model Size Speed (tok/s) vs F16 speed SQNR Top-1 Agree KL Div
F16 (baseline) 8.67 GB 5.7 1.0ร— baseline baseline baseline
Q2_K (this) 2.78 GB 31.6 5.6ร— 5.85 dB 32.0% 4.1149
Q3_K_S 2.90 GB 28.9 5.1ร— 10.12 dB 63.2% 1.2605
Q3_K_M 2.98 GB 27.4 4.8ร— 13.93 dB 63.2% 1.6747
Q4_K_M 3.19 GB 24.0 4.2ร— 20.33 dB 82.4% 0.3356
Q5_K_M 3.38 GB 22.0 3.9ร— 23.25 dB 86.9% 0.1248
Q6_K 3.58 GB 19.9 3.5ร— 28.72 dB 94.1% 0.0743
Q8 4.63 GB 16.2 2.9ร— 37.11 dB 96.0% 0.0171

Key Findings

  • Quality: Significant degradation โ€” only 32% Top-1 agreement with F16; output can be incoherent (see sample below)
  • Speed: 31.6 tok/s โ€” fastest in the series, 5.6ร— faster than F16
  • Size: 2.78 GB โ€” fits in under 4 GB RAM
  • Best for: Extreme RAM-constrained environments where some output quality loss is acceptable; not recommended for reasoning or code tasks

โš ๏ธ Warning: Q2_K produces visibly broken outputs on this model. Sample response to a math prompt repeated token garbage (skills skills skills...). Consider Q3_K_S or higher for usable results.


Usage

# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q2k.gguf -p "Solve step by step: 2 + 2 = ?" -n 200
# llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-e2b-q2k.gguf", n_ctx=2048)
output = llm("Solve step by step: 2 + 2 = ?", max_tokens=200)
print(output["choices"][0]["text"])

Hardware

Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding

Downloads last month
401
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support