Gemma 4 E2B it โ Q3_K_S GGUF
3-bit small quantized GGUF version of google/gemma-4-e2b-it.
Slightly smaller and faster than Q3_K_M with comparable output quality.
Other quantizations in this series:
Q2_K ยท Q3_K_M ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M ยท Q6_K ยท Q8
File Info
| Property | Value |
|---|---|
| Format | GGUF Q3_K_S |
| File size | 3.11 GB |
| Bits per weight | ~3 |
| Size vs F16 | 3.0ร smaller |
Benchmark Results
Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.
Results by Category
| Category | Speed (tok/s) | SQNR | Top-1 Agreement | KL Divergence |
|---|---|---|---|---|
| ๐ข Math | 29.0 | 10.4 dB | 64.3% | 1.2991 |
| ๐ง Logic | 28.9 | 10.9 dB | 61.9% | 1.4255 |
| ๐ป Code | 28.9 | 9.5 dB | 64.4% | 1.0009 |
| ๐ฌ Science | 28.8 | 9.4 dB | 62.0% | 1.3165 |
| Overall | 28.9 | 10.12 dB | 63.2% | 1.2605 |
Quantization Comparison
| Model | Size | Speed (tok/s) | vs F16 speed | SQNR | Top-1 Agree | KL Div |
|---|---|---|---|---|---|---|
| F16 (baseline) | 8.67 GB | 5.7 | 1.0ร | baseline | baseline | baseline |
| Q2_K | 2.99 GB | 31.6 | 5.6ร | 5.85 dB | 32.0% | 4.1149 |
| Q3_K_S (this) | 3.11 GB | 28.9 | 5.1ร | 10.12 dB | 63.2% | 1.2605 |
| Q3_K_M | 2.98 GB | 27.4 | 4.8ร | 13.93 dB | 63.2% | 1.6747 |
| Q4_K_S | 3.37 GB | 25.0 | 4.4ร | 19.10 dB | 80.9% | 0.3456 |
| Q4_K_M | 3.43 GB | 24.0 | 4.2ร | 20.33 dB | 82.4% | 0.3356 |
| Q5_K_S | 3.6 GB | 21.9 | 3.9x | 23.32 dB | 87.7% | 0.1547 |
| Q5_K_M | 3.63 GB | 22.0 | 3.9ร | 23.25 dB | 86.9% | 0.1248 |
| Q8 | 4.97 GB | 16.2 | 2.9ร | 37.11 dB | 96.0% | 0.0171 |
Key Findings
- Quality: 63.2% Top-1 agreement โ same as Q3_K_M at the aggregate level, with lower KL divergence (1.26 vs 1.67)
- Speed: 28.9 tok/s โ slightly faster than Q3_K_M (27.4 tok/s)
- Size: 2.90 GB โ 80 MB smaller than Q3_K_M
- vs Q3_K_M: Q3_K_S has lower SQNR but better KL divergence, meaning the probability distributions are actually closer to F16 despite lower signal quality โ practically equivalent for most tasks
- Best for: Same use case as Q3_K_M; prefer Q3_K_S if you need a tiny speed advantage
Usage
# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q3ks.gguf -p "Explain the water cycle." -n 200
# llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="gemma-4-e2b-q3ks.gguf", n_ctx=2048)
output = llm("Explain the water cycle.", max_tokens=200)
print(output["choices"][0]["text"])
Hardware
Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding
- Downloads last month
- 506
Hardware compatibility
Log In to add your hardware
3-bit