Gemma-3-1B-IT BitsAndBytesConfig NF4 Quantized

This model is a quantized version of google/gemma-3-1b-it-qat-int4-unquantized using BitsAndBytesConfig with NF4 quantization.

Model Details

  • Base Model: google/gemma-3-1b-it-qat-int4-unquantized
  • Quantization: BitsAndBytesConfig NF4 (4-bit)
  • Quantization Type: NF4 with double quantization
  • Compute Dtype: bfloat16
  • Storage Dtype: uint8

Quantization Configuration

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.uint8
)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Benefits

  • Reduced Memory Usage: ~75% reduction in memory footprint compared to full precision
  • Faster Inference: Optimized for inference speed
  • Maintained Quality: NF4 quantization preserves model quality effectively

Hardware Requirements

  • GPU Memory: ~3-4GB VRAM (vs ~12GB for FP16)
  • CUDA Compatible: Requires CUDA-capable GPU for optimal performance
  • CPU Fallback: Can run on CPU with reduced performance

Quantization Details

This model uses BitsAndBytesConfig for 4-bit quantization:

  • NF4 (Normal Float 4) quantization for optimal quality/size trade-off
  • Double quantization for additional compression
  • Mixed precision with bfloat16 compute dtype

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
10
Safetensors
Model size
1B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4

Quantized
(4)
this model