File size: 4,042 Bytes

2ae60f5

---
license: llama3.1
tags:
  - llama3.1
  - quantization
  - bitsandbytes
  - nlp
  - instruct
library_name: transformers
---

# 🚀 Quantized Llama-3.1-8B-Instruct Model

This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU.

## 🧠 Model Description

The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks.

## 🧮 Quantization Details

- **Base Model**: `meta-llama/Llama-3.1-8B-Instruct`
- **Quantization Method**: 4-bit (NormalFloat4, NF4) with double quantization
- **Compute Dtype**: float16
- **Library**: `bitsandbytes==0.43.3`
- **Framework**: `transformers==4.45.1`
- **Hardware**: NVIDIA T4 GPU (16GB VRAM) in Google Colab
- **Date**: Quantized on June 20, 2025

## 📦 Files Included

- `README.md`: This file
- `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights
- `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files

## Usage

To load and use the quantized model for inference:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch

# Define quantization configuration
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "your-username/quantized_Llama-3.1-8B-Instruct",  # Replace with your Hugging Face repo ID
    quantization_config=quant_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct")

# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Perform inference
prompt = "Hello, how can I assist you today?"
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output)
```

## Quantization Process

The model was quantized in Google Colab using the following script:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from huggingface_hub import login

# Log in to Hugging Face
login()  # Requires a Hugging Face token

# Define quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token

# Save the quantized model
quant_path = "/content/quantized_Llama-3.1-8B-Instruct"
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
```

## Requirements

- **Hardware**: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100)
- **Python**: 3.10+
- **Dependencies**:
  - `transformers==4.45.1`
  - `bitsandbytes==0.43.3`
  - `accelerate==0.33.0`
  - `torch` (with CUDA support)

## Notes

- The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment.
- Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence.
- Access to the base model requires a Hugging Face token and approval from Meta AI.

## License

This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

## Acknowledgments

- Created using Hugging Face Transformers and `bitsandbytes` for quantization.
- Quantized in Google Colab with a T4 GPU on June 20, 2025.