Gemma 4 E2B it — INT8 Quantization (bitsandbytes)

Quantized version of google/gemma-4-e2b-it using bitsandbytes INT8. Tested on RTX 5090 (Blackwell, sm_120).

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding (do_sample=False), 200 max new tokens.

Metric FP16 (baseline) Q8 Q4
SQNR 27.49 dB 18.75 dB
Top-1 Agreement 92.9% 81.1%
KL Divergence 0.0496 0.3334
Speed (tok/s) 56.9 14.5 40.2
VRAM 9.5 GB 7.4 GB 6.3 GB

Results by Category

Category SQNR Top-1 Agreement KL Divergence Speed (tok/s)
🔢 Math 27.09 dB 92.4% 0.0424 14.8
🧠 Logic 27.18 dB 92.8% 0.0802 13.9
💻 Code 29.49 dB 94.5% 0.0346 14.8
🔬 Science 26.34 dB 92.1% 0.0410 14.7

Key Findings

  • Quality: Q8 retains 92.9% token agreement with FP16 — nearly identical output
  • VRAM: Saves 2.1 GB vs FP16 (1.3x compression)
  • Speed: 14.5 tok/s — slower than Q4 due to bfloat16→float16 cast overhead in bitsandbytes on Blackwell GPUs
  • Best for: Cases where quality is priority over speed

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model = AutoModelForCausalLM.from_pretrained(
    "MichaelLowrance/gemma-4-e2b-q8",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("MichaelLowrance/gemma-4-e2b-q8")

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware

Tested on: NVIDIA RTX 5090 (Blackwell, sm_120, 32GB GDDR7)
CUDA: 12.8 | Python: 3.12 | transformers: 5.6.0.dev

Downloads last month
64
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support