Gemma 4 E2B it — INT8 Quantization (bitsandbytes)
Quantized version of google/gemma-4-e2b-it using bitsandbytes INT8. Tested on RTX 5090 (Blackwell, sm_120).
Benchmark Results
Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding (do_sample=False), 200 max new tokens.
| Metric | FP16 (baseline) | Q8 | Q4 |
|---|---|---|---|
| SQNR | — | 27.49 dB | 18.75 dB |
| Top-1 Agreement | — | 92.9% | 81.1% |
| KL Divergence | — | 0.0496 | 0.3334 |
| Speed (tok/s) | 56.9 | 14.5 | 40.2 |
| VRAM | 9.5 GB | 7.4 GB | 6.3 GB |
Results by Category
| Category | SQNR | Top-1 Agreement | KL Divergence | Speed (tok/s) |
|---|---|---|---|---|
| 🔢 Math | 27.09 dB | 92.4% | 0.0424 | 14.8 |
| 🧠 Logic | 27.18 dB | 92.8% | 0.0802 | 13.9 |
| 💻 Code | 29.49 dB | 94.5% | 0.0346 | 14.8 |
| 🔬 Science | 26.34 dB | 92.1% | 0.0410 | 14.7 |
Key Findings
- Quality: Q8 retains 92.9% token agreement with FP16 — nearly identical output
- VRAM: Saves 2.1 GB vs FP16 (1.3x compression)
- Speed: 14.5 tok/s — slower than Q4 due to bfloat16→float16 cast overhead in bitsandbytes on Blackwell GPUs
- Best for: Cases where quality is priority over speed
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model = AutoModelForCausalLM.from_pretrained(
"MichaelLowrance/gemma-4-e2b-q8",
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("MichaelLowrance/gemma-4-e2b-q8")
messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Hardware
Tested on: NVIDIA RTX 5090 (Blackwell, sm_120, 32GB GDDR7)
CUDA: 12.8 | Python: 3.12 | transformers: 5.6.0.dev
- Downloads last month
- 64
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support