quantized_Llama-3.1-8B-Instruct / README.md

Upload README.md with huggingface_hub

2ae60f5 verified 7 months ago

4.04 kB

	---
	license: llama3.1
	tags:
	- llama3.1
	- quantization
	- bitsandbytes
	- nlp
	- instruct
	library_name: transformers
	---

	# 🚀 Quantized Llama-3.1-8B-Instruct Model

	This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU.

	## 🧠 Model Description

	The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks.

	## 🧮 Quantization Details

	- Base Model: `meta-llama/Llama-3.1-8B-Instruct`
	- Quantization Method: 4-bit (NormalFloat4, NF4) with double quantization
	- Compute Dtype: float16
	- Library: `bitsandbytes==0.43.3`
	- Framework: `transformers==4.45.1`
	- Hardware: NVIDIA T4 GPU (16GB VRAM) in Google Colab
	- Date: Quantized on June 20, 2025

	## 📦 Files Included

	- `README.md`: This file
	- `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights
	- `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files

	## Usage

	To load and use the quantized model for inference:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
	import torch

	# Define quantization configuration
	quant_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True
	)

	# Load the quantized model
	model = AutoModelForCausalLM.from_pretrained(
	"your-username/quantized_Llama-3.1-8B-Instruct", # Replace with your Hugging Face repo ID
	quantization_config=quant_config,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct")

	# Create a text generation pipeline
	generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

	# Perform inference
	prompt = "Hello, how can I assist you today?"
	output = generator(prompt, max_length=50, num_return_sequences=1)
	print(output)
	```

	## Quantization Process

	The model was quantized in Google Colab using the following script:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	import torch
	from huggingface_hub import login

	# Log in to Hugging Face
	login() # Requires a Hugging Face token

	# Define quantization configuration
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True
	)

	# Load and quantize the model
	model = AutoModelForCausalLM.from_pretrained(
	"meta-llama/Llama-3.1-8B-Instruct",
	quantization_config=quantization_config,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
	tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token

	# Save the quantized model
	quant_path = "/content/quantized_Llama-3.1-8B-Instruct"
	model.save_pretrained(quant_path)
	tokenizer.save_pretrained(quant_path)
	```

	## Requirements

	- Hardware: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100)
	- Python: 3.10+
	- Dependencies:
	- `transformers==4.45.1`
	- `bitsandbytes==0.43.3`
	- `accelerate==0.33.0`
	- `torch` (with CUDA support)

	## Notes

	- The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment.
	- Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence.
	- Access to the base model requires a Hugging Face token and approval from Meta AI.

	## License

	This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

	## Acknowledgments

	- Created using Hugging Face Transformers and `bitsandbytes` for quantization.
	- Quantized in Google Colab with a T4 GPU on June 20, 2025.