Gemma 4 31B IT - 4-bit NF4 Quantized

4-bit NF4 quantized version of google/gemma-4-31b-it using bitsandbytes with double quantization.

Quantization Details

  • Method: NF4 (4-bit NormalFloat)
  • Double Quantization: Enabled (quantizes the quantization constants)
  • Compute dtype: bfloat16
  • Skipped modules: lm_head (kept in full bfloat16)
  • Original size: ~62 GB (bf16)
  • Quantized size: ~20 GB

Usage

import torch
from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "suitch/gemma-4-31B-it-4bit"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

# Video inference
messages = [
    {"role": "user", "content": [
        {"type": "video", "video": "path/to/video.mp4"},
        {"type": "text", "text": "Describe what you see in this video."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, videos=["path/to/video.mp4"], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
print(processor.parse_response(response))

Requirements

  • transformers >= 4.48
  • torch >= 2.4
  • bitsandbytes >= 0.46
  • torchcodec (for video input)

Hardware Requirements

  • ~16 GB GPU VRAM for inference
Downloads last month
140
Safetensors
Model size
32B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support