File size: 4,042 Bytes
2ae60f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
license: llama3.1
tags:
- llama3.1
- quantization
- bitsandbytes
- nlp
- instruct
library_name: transformers
---
# 🚀 Quantized Llama-3.1-8B-Instruct Model
This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU.
## 🧠 Model Description
The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks.
## 🧮 Quantization Details
- **Base Model**: `meta-llama/Llama-3.1-8B-Instruct`
- **Quantization Method**: 4-bit (NormalFloat4, NF4) with double quantization
- **Compute Dtype**: float16
- **Library**: `bitsandbytes==0.43.3`
- **Framework**: `transformers==4.45.1`
- **Hardware**: NVIDIA T4 GPU (16GB VRAM) in Google Colab
- **Date**: Quantized on June 20, 2025
## 📦 Files Included
- `README.md`: This file
- `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights
- `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files
## Usage
To load and use the quantized model for inference:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
# Define quantization configuration
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
"your-username/quantized_Llama-3.1-8B-Instruct", # Replace with your Hugging Face repo ID
quantization_config=quant_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct")
# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Perform inference
prompt = "Hello, how can I assist you today?"
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output)
```
## Quantization Process
The model was quantized in Google Colab using the following script:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from huggingface_hub import login
# Log in to Hugging Face
login() # Requires a Hugging Face token
# Define quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token
# Save the quantized model
quant_path = "/content/quantized_Llama-3.1-8B-Instruct"
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
```
## Requirements
- **Hardware**: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100)
- **Python**: 3.10+
- **Dependencies**:
- `transformers==4.45.1`
- `bitsandbytes==0.43.3`
- `accelerate==0.33.0`
- `torch` (with CUDA support)
## Notes
- The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment.
- Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence.
- Access to the base model requires a Hugging Face token and approval from Meta AI.
## License
This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
## Acknowledgments
- Created using Hugging Face Transformers and `bitsandbytes` for quantization.
- Quantized in Google Colab with a T4 GPU on June 20, 2025.
|