Buckets:

hf-doc-build/doc / bitsandbytes /main /en /quickstart.md
|
download
raw
4.03 kB

Quickstart

Welcome to bitsandbytes! This library enables accessible large language models via k-bit quantization for PyTorch, dramatically reducing memory consumption for inference and training.

Installation

pip install bitsandbytes

Requirements: Python 3.10+, PyTorch 2.3+

For detailed installation instructions, see the Installation Guide.

What is bitsandbytes?

bitsandbytes provides three main features:

  • LLM.int8(): 8-bit quantization for inference (50% memory reduction)
  • QLoRA: 4-bit quantization for training (75% memory reduction)
  • 8-bit Optimizers: Memory-efficient optimizers for training

Quick Examples

8-bit Inference

Load and run a model using 8-bit quantization:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hello, my name is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Learn more: See the Integrations guide for more details on using bitsandbytes with Transformers.

4-bit Quantization

For even greater memory savings:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

QLoRA Fine-tuning

Combine 4-bit quantization with LoRA for efficient training:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load 4-bit model
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Now train with your preferred trainer

Learn more: See the FSDP-QLoRA guide for advanced training techniques and the Integrations guide for using with PEFT.

8-bit Optimizers

Use 8-bit optimizers to reduce training memory by 75%:

import bitsandbytes as bnb

model = YourModel()

# Replace standard optimizer with 8-bit version
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3)

# Use in training loop as normal
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Learn more: See the 8-bit Optimizers guide for detailed usage and configuration options.

Custom Quantized Layers

Use quantized linear layers directly in your models:

import torch
import bitsandbytes as bnb

# 8-bit linear layer
linear_8bit = bnb.nn.Linear8bitLt(1024, 1024, has_fp16_weights=False)

# 4-bit linear layer
linear_4bit = bnb.nn.Linear4bit(1024, 1024, compute_dtype=torch.bfloat16)

Next Steps

Getting Help

Xet Storage Details

Size:
4.03 kB
·
Xet hash:
ff1936269d752557d3cbc7949667af9c5290fdf48e600746f667f4dfef5a8cd4

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.