Gemma 4 E2B it — NF4 Quantization (bitsandbytes)
Quantized version of google/gemma-4-e2b-it using bitsandbytes NF4 (4-bit). Tested on RTX 5090 (Blackwell, sm_120).
Benchmark Results
Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding (do_sample=False), 200 max new tokens.
| Metric | FP16 (baseline) | Q8 | Q4 |
|---|---|---|---|
| SQNR | — | 27.49 dB | 18.75 dB |
| Top-1 Agreement | — | 92.9% | 81.1% |
| KL Divergence | — | 0.0496 | 0.3334 |
| Speed (tok/s) | 56.9 | 14.5 | 40.2 |
| VRAM | 9.5 GB | 7.4 GB | 6.3 GB |
Results by Category
| Category | SQNR | Top-1 Agreement | KL Divergence | Speed (tok/s) |
|---|---|---|---|---|
| 🔢 Math | 18.04 dB | 81.3% | 0.3133 | 40.2 |
| 🧠 Logic | 18.14 dB | 79.3% | 0.4146 | 39.8 |
| 💻 Code | 21.09 dB | 82.4% | 0.2495 | 40.3 |
| 🔬 Science | 18.30 dB | 81.4% | 0.3562 | 40.3 |
Key Findings
- Quality: 81.1% token agreement with FP16 — minor degradation, visually acceptable output
- VRAM: Saves 3.2 GB vs FP16 (1.5x compression)
- Speed: 40.2 tok/s — fastest among quantized versions, 70% of FP16 speed
- Best for: When VRAM is limited and speed matters more than perfect quality
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model = AutoModelForCausalLM.from_pretrained(
"MichaelLowrance/gemma-4-e2b-q4",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
),
device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("MichaelLowrance/gemma-4-e2b-q4")
messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Hardware
Tested on: NVIDIA RTX 5090 (Blackwell, sm_120, 32GB GDDR7)
CUDA: 12.8 | Python: 3.12 | transformers: 5.6.0.dev
- Downloads last month
- 45
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support