Gemma 4 31B IT - 4-bit NF4 Quantized

4-bit NF4 quantized version of google/gemma-4-31b-it using bitsandbytes with double quantization.

Quantization Details

Method: NF4 (4-bit NormalFloat)
Double Quantization: Enabled (quantizes the quantization constants)
Compute dtype: bfloat16
Skipped modules: lm_head (kept in full bfloat16)
Original size: ~62 GB (bf16)
Quantized size: ~20 GB

Usage

import torch
from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "suitch/gemma-4-31B-it-4bit"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

# Video inference
messages = [
    {"role": "user", "content": [
        {"type": "video", "video": "path/to/video.mp4"},
        {"type": "text", "text": "Describe what you see in this video."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, videos=["path/to/video.mp4"], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
print(processor.parse_response(response))

Requirements

transformers >= 4.48
torch >= 2.4
bitsandbytes >= 0.46
torchcodec (for video input)

Hardware Requirements

~16 GB GPU VRAM for inference

Downloads last month: 1

Safetensors

Model size

32B params

Tensor type

F32

BF16