Gemma 4 31B IT - 4-bit NF4 Quantized
4-bit NF4 quantized version of google/gemma-4-31b-it using bitsandbytes with double quantization.
Quantization Details
- Method: NF4 (4-bit NormalFloat)
- Double Quantization: Enabled (quantizes the quantization constants)
- Compute dtype: bfloat16
- Skipped modules: lm_head (kept in full bfloat16)
- Original size: ~62 GB (bf16)
- Quantized size: ~20 GB
Usage
import torch
from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "suitch/gemma-4-31B-it-4bit"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
# Video inference
messages = [
{"role": "user", "content": [
{"type": "video", "video": "path/to/video.mp4"},
{"type": "text", "text": "Describe what you see in this video."}
]}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, videos=["path/to/video.mp4"], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
print(processor.parse_response(response))
Requirements
- transformers >= 4.48
- torch >= 2.4
- bitsandbytes >= 0.46
- torchcodec (for video input)
Hardware Requirements
- ~16 GB GPU VRAM for inference
- Downloads last month
- 140