Gemma 4 E2B it โ Q8 GGUF
8-bit quantized GGUF version of google/gemma-4-e2b-it.
Highest quality quantization โ 96% Top-1 agreement with F16, effectively lossless.
Other quantizations in this series:
Q2_K ยท Q3_K_S ยท Q3_K_M ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M
File Info
| Property | Value |
|---|---|
| Format | GGUF Q8 |
| File size | 4.63 GB |
| Bits per weight | ~8 |
| Size vs F16 | 1.9ร smaller |
Benchmark Results
Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.
Results by Category
| Category | Speed (tok/s) | SQNR | Top-1 Agreement | KL Divergence |
|---|---|---|---|---|
| ๐ข Math | 16.2 | 37.1 dB | 95.7% | 0.0151 |
| ๐ง Logic | 16.2 | 36.8 dB | 96.6% | 0.0166 |
| ๐ป Code | 16.3 | 37.8 dB | 97.4% | 0.0155 |
| ๐ฌ Science | 16.3 | 36.7 dB | 94.2% | 0.0209 |
| Overall | 16.2 | 37.11 dB | 96.0% | 0.0171 |
Quantization Comparison
| Model | Size | Speed (tok/s) | vs F16 speed | SQNR | Top-1 Agree | KL Div |
|---|---|---|---|---|---|---|
| F16 (baseline) | 8.67 GB | 5.7 | 1.0ร | baseline | baseline | baseline |
| Q4_K_M | 3.19 GB | 24.0 | 4.2ร | 20.33 dB | 82.4% | 0.3356 |
| Q5_K_M | 3.38 GB | 22.0 | 3.9ร | 23.25 dB | 86.9% | 0.1248 |
| Q6_K | 3.58 GB | 19.9 | 3.5ร | 28.72 dB | 94.1% | 0.0743 |
| Q8 (this) | 4.63 GB | 16.2 | 2.9ร | 37.11 dB | 96.0% | 0.0171 |
Key Findings
- Quality: 37.11 dB SQNR and KL divergence of just 0.017 โ for all practical purposes, identical to F16
- Top-1 Agreement: 96.0% โ the model picks the same token as F16 96 times out of 100
- Speed: 16.2 tok/s โ still 2.9ร faster than F16, just slower than lower-bit quants
- Size: 4.63 GB โ fits in 6 GB RAM; half the size of F16
- vs Q6_K: +8.4 dB SQNR and +1.9% Top-1 for 1.05 GB extra; worth it if you have the RAM
- Best for: Maximum quality with reasonable size reduction; production deployments where output must match F16 as closely as possible
Usage
# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q8.gguf -p "Explain how a transformer neural network works." -n 200
# llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="gemma-4-e2b-q8.gguf", n_ctx=2048)
output = llm("Explain how a transformer neural network works.", max_tokens=200)
print(output["choices"][0]["text"])
Hardware
Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding
- Downloads last month
- 373
Hardware compatibility
Log In to add your hardware
8-bit