Gemma 4 E2B โ Q6_K GGUF
6-bit quantized GGUF version of google/gemma-4-e2b-it.
Near-lossless quantization โ 94% Top-1 agreement with F16 at a fraction of the size.
Other quantizations in this series:
Q2_K ยท Q3_K_S ยท Q3_K_M ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M ยท Q8
File Info
| Property | Value |
|---|---|
| Format | GGUF Q6_K |
| File size | 3.58 GB |
| Bits per weight | ~6 |
| Size vs F16 | 2.4ร smaller |
Benchmark Results
Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.
Results by Category
| Category | Speed (tok/s) | SQNR | Top-1 Agreement | KL Divergence |
|---|---|---|---|---|
| ๐ข Math | 19.7 | 28.7 dB | 94.3% | 0.0796 |
| ๐ง Logic | 19.9 | 29.3 dB | 93.2% | 0.0891 |
| ๐ป Code | 20.0 | 29.0 dB | 93.6% | 0.0502 |
| ๐ฌ Science | 19.9 | 27.8 dB | 95.1% | 0.0784 |
| Overall | 19.9 | 28.72 dB | 94.1% | 0.0743 |
Quantization Comparison
| Model | Size | Speed (tok/s) | vs F16 speed | SQNR | Top-1 Agree | KL Div |
|---|---|---|---|---|---|---|
| F16 (baseline) | 8.67 GB | 5.7 | 1.0ร | baseline | baseline | baseline |
| Q4_K_M | 3.19 GB | 24.0 | 4.2ร | 20.33 dB | 82.4% | 0.3356 |
| Q5_K_M | 3.38 GB | 22.0 | 3.9ร | 23.25 dB | 86.9% | 0.1248 |
| Q6_K (this) | 3.58 GB | 19.9 | 3.5ร | 28.72 dB | 94.1% | 0.0743 |
| Q8 | 4.63 GB | 16.2 | 2.9ร | 37.11 dB | 96.0% | 0.0171 |
Key Findings
- Quality: 94.1% Top-1 agreement โ crosses the "near-identical to F16" threshold; only Q8 is better
- SQNR: 28.72 dB โ a substantial 5.5 dB jump over Q5_K_M; outputs are essentially indistinguishable from F16 in practice
- Speed: 19.9 tok/s โ 3.5ร faster than F16
- Size: 3.58 GB โ only 1 GB more than Q4_K_M for a major quality improvement
- vs Q8: Q6_K is 1.05 GB smaller and 3.7 tok/s faster, with only a small quality difference (94.1 vs 96.0% Top-1)
- Best for: Quality-sensitive tasks where you want near-F16 fidelity but still need to fit under ~5 GB; scientific explanation, precise reasoning, complex multi-step code
Usage
# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q6k.gguf -p "Explain the difference between supervised and reinforcement learning." -n 200
# llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="gemma-4-e2b-q6k.gguf", n_ctx=2048)
output = llm("Explain the difference between supervised and reinforcement learning.", max_tokens=200)
print(output["choices"][0]["text"])
Hardware
Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding
- Downloads last month
- 321
Hardware compatibility
Log In to add your hardware
6-bit