Gemma 4 E2B it — NF4 Quantization (bitsandbytes)

Quantized version of google/gemma-4-e2b-it using bitsandbytes NF4 (4-bit). Tested on RTX 5090 (Blackwell, sm_120).

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding (do_sample=False), 200 max new tokens.

Metric FP16 (baseline) Q8 Q4
SQNR 27.49 dB 18.75 dB
Top-1 Agreement 92.9% 81.1%
KL Divergence 0.0496 0.3334
Speed (tok/s) 56.9 14.5 40.2
VRAM 9.5 GB 7.4 GB 6.3 GB

Results by Category

Category SQNR Top-1 Agreement KL Divergence Speed (tok/s)
🔢 Math 18.04 dB 81.3% 0.3133 40.2
🧠 Logic 18.14 dB 79.3% 0.4146 39.8
💻 Code 21.09 dB 82.4% 0.2495 40.3
🔬 Science 18.30 dB 81.4% 0.3562 40.3

Key Findings

  • Quality: 81.1% token agreement with FP16 — minor degradation, visually acceptable output
  • VRAM: Saves 3.2 GB vs FP16 (1.5x compression)
  • Speed: 40.2 tok/s — fastest among quantized versions, 70% of FP16 speed
  • Best for: When VRAM is limited and speed matters more than perfect quality

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model = AutoModelForCausalLM.from_pretrained(
    "MichaelLowrance/gemma-4-e2b-q4",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    ),
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("MichaelLowrance/gemma-4-e2b-q4")

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware

Tested on: NVIDIA RTX 5090 (Blackwell, sm_120, 32GB GDDR7)
CUDA: 12.8 | Python: 3.12 | transformers: 5.6.0.dev

Downloads last month
45
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support