|
|
--- |
|
|
license: llama3.1 |
|
|
tags: |
|
|
- llama3.1 |
|
|
- quantization |
|
|
- bitsandbytes |
|
|
- nlp |
|
|
- instruct |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# 🚀 Quantized Llama-3.1-8B-Instruct Model |
|
|
|
|
|
This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU. |
|
|
|
|
|
## 🧠 Model Description |
|
|
|
|
|
The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks. |
|
|
|
|
|
## 🧮 Quantization Details |
|
|
|
|
|
- **Base Model**: `meta-llama/Llama-3.1-8B-Instruct` |
|
|
- **Quantization Method**: 4-bit (NormalFloat4, NF4) with double quantization |
|
|
- **Compute Dtype**: float16 |
|
|
- **Library**: `bitsandbytes==0.43.3` |
|
|
- **Framework**: `transformers==4.45.1` |
|
|
- **Hardware**: NVIDIA T4 GPU (16GB VRAM) in Google Colab |
|
|
- **Date**: Quantized on June 20, 2025 |
|
|
|
|
|
## 📦 Files Included |
|
|
|
|
|
- `README.md`: This file |
|
|
- `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights |
|
|
- `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files |
|
|
|
|
|
## Usage |
|
|
|
|
|
To load and use the quantized model for inference: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline |
|
|
import torch |
|
|
|
|
|
# Define quantization configuration |
|
|
quant_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_compute_dtype=torch.float16, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_use_double_quant=True |
|
|
) |
|
|
|
|
|
# Load the quantized model |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"your-username/quantized_Llama-3.1-8B-Instruct", # Replace with your Hugging Face repo ID |
|
|
quantization_config=quant_config, |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct") |
|
|
|
|
|
# Create a text generation pipeline |
|
|
generator = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
|
|
|
|
# Perform inference |
|
|
prompt = "Hello, how can I assist you today?" |
|
|
output = generator(prompt, max_length=50, num_return_sequences=1) |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
## Quantization Process |
|
|
|
|
|
The model was quantized in Google Colab using the following script: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
import torch |
|
|
from huggingface_hub import login |
|
|
|
|
|
# Log in to Hugging Face |
|
|
login() # Requires a Hugging Face token |
|
|
|
|
|
# Define quantization configuration |
|
|
quantization_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_compute_dtype=torch.float16, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_use_double_quant=True |
|
|
) |
|
|
|
|
|
# Load and quantize the model |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"meta-llama/Llama-3.1-8B-Instruct", |
|
|
quantization_config=quantization_config, |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") |
|
|
tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token |
|
|
|
|
|
# Save the quantized model |
|
|
quant_path = "/content/quantized_Llama-3.1-8B-Instruct" |
|
|
model.save_pretrained(quant_path) |
|
|
tokenizer.save_pretrained(quant_path) |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- **Hardware**: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100) |
|
|
- **Python**: 3.10+ |
|
|
- **Dependencies**: |
|
|
- `transformers==4.45.1` |
|
|
- `bitsandbytes==0.43.3` |
|
|
- `accelerate==0.33.0` |
|
|
- `torch` (with CUDA support) |
|
|
|
|
|
## Notes |
|
|
|
|
|
- The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment. |
|
|
- Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence. |
|
|
- Access to the base model requires a Hugging Face token and approval from Meta AI. |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Created using Hugging Face Transformers and `bitsandbytes` for quantization. |
|
|
- Quantized in Google Colab with a T4 GPU on June 20, 2025. |
|
|
|