devatar's picture
Upload README.md with huggingface_hub
2ae60f5 verified
---
license: llama3.1
tags:
- llama3.1
- quantization
- bitsandbytes
- nlp
- instruct
library_name: transformers
---
# 🚀 Quantized Llama-3.1-8B-Instruct Model
This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU.
## 🧠 Model Description
The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks.
## 🧮 Quantization Details
- **Base Model**: `meta-llama/Llama-3.1-8B-Instruct`
- **Quantization Method**: 4-bit (NormalFloat4, NF4) with double quantization
- **Compute Dtype**: float16
- **Library**: `bitsandbytes==0.43.3`
- **Framework**: `transformers==4.45.1`
- **Hardware**: NVIDIA T4 GPU (16GB VRAM) in Google Colab
- **Date**: Quantized on June 20, 2025
## 📦 Files Included
- `README.md`: This file
- `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights
- `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files
## Usage
To load and use the quantized model for inference:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
# Define quantization configuration
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
"your-username/quantized_Llama-3.1-8B-Instruct", # Replace with your Hugging Face repo ID
quantization_config=quant_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct")
# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Perform inference
prompt = "Hello, how can I assist you today?"
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output)
```
## Quantization Process
The model was quantized in Google Colab using the following script:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from huggingface_hub import login
# Log in to Hugging Face
login() # Requires a Hugging Face token
# Define quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token
# Save the quantized model
quant_path = "/content/quantized_Llama-3.1-8B-Instruct"
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
```
## Requirements
- **Hardware**: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100)
- **Python**: 3.10+
- **Dependencies**:
- `transformers==4.45.1`
- `bitsandbytes==0.43.3`
- `accelerate==0.33.0`
- `torch` (with CUDA support)
## Notes
- The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment.
- Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence.
- Access to the base model requires a Hugging Face token and approval from Meta AI.
## License
This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
## Acknowledgments
- Created using Hugging Face Transformers and `bitsandbytes` for quantization.
- Quantized in Google Colab with a T4 GPU on June 20, 2025.