devatar commited on
Commit
2ae60f5
·
verified ·
1 Parent(s): 15b88c3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ tags:
4
+ - llama3.1
5
+ - quantization
6
+ - bitsandbytes
7
+ - nlp
8
+ - instruct
9
+ library_name: transformers
10
+ ---
11
+
12
+ # 🚀 Quantized Llama-3.1-8B-Instruct Model
13
+
14
+ This is a 4-bit quantized version of the `meta-llama/Llama-3.1-8B-Instruct` model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU.
15
+
16
+ ## 🧠 Model Description
17
+
18
+ The model was quantized using the `bitsandbytes` library to reduce memory usage while maintaining performance for instruction-following tasks.
19
+
20
+ ## 🧮 Quantization Details
21
+
22
+ - **Base Model**: `meta-llama/Llama-3.1-8B-Instruct`
23
+ - **Quantization Method**: 4-bit (NormalFloat4, NF4) with double quantization
24
+ - **Compute Dtype**: float16
25
+ - **Library**: `bitsandbytes==0.43.3`
26
+ - **Framework**: `transformers==4.45.1`
27
+ - **Hardware**: NVIDIA T4 GPU (16GB VRAM) in Google Colab
28
+ - **Date**: Quantized on June 20, 2025
29
+
30
+ ## 📦 Files Included
31
+
32
+ - `README.md`: This file
33
+ - `config.json`, `pytorch_model.bin` (or sharded checkpoints): Model weights
34
+ - `special_tokens_map.json`, `tokenizer.json`, `tokenizer_config.json`: Tokenizer files
35
+
36
+ ## Usage
37
+
38
+ To load and use the quantized model for inference:
39
+
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
42
+ import torch
43
+
44
+ # Define quantization configuration
45
+ quant_config = BitsAndBytesConfig(
46
+ load_in_4bit=True,
47
+ bnb_4bit_compute_dtype=torch.float16,
48
+ bnb_4bit_quant_type="nf4",
49
+ bnb_4bit_use_double_quant=True
50
+ )
51
+
52
+ # Load the quantized model
53
+ model = AutoModelForCausalLM.from_pretrained(
54
+ "your-username/quantized_Llama-3.1-8B-Instruct", # Replace with your Hugging Face repo ID
55
+ quantization_config=quant_config,
56
+ device_map="auto"
57
+ )
58
+ tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct")
59
+
60
+ # Create a text generation pipeline
61
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
62
+
63
+ # Perform inference
64
+ prompt = "Hello, how can I assist you today?"
65
+ output = generator(prompt, max_length=50, num_return_sequences=1)
66
+ print(output)
67
+ ```
68
+
69
+ ## Quantization Process
70
+
71
+ The model was quantized in Google Colab using the following script:
72
+
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
75
+ import torch
76
+ from huggingface_hub import login
77
+
78
+ # Log in to Hugging Face
79
+ login() # Requires a Hugging Face token
80
+
81
+ # Define quantization configuration
82
+ quantization_config = BitsAndBytesConfig(
83
+ load_in_4bit=True,
84
+ bnb_4bit_compute_dtype=torch.float16,
85
+ bnb_4bit_quant_type="nf4",
86
+ bnb_4bit_use_double_quant=True
87
+ )
88
+
89
+ # Load and quantize the model
90
+ model = AutoModelForCausalLM.from_pretrained(
91
+ "meta-llama/Llama-3.1-8B-Instruct",
92
+ quantization_config=quantization_config,
93
+ device_map="auto"
94
+ )
95
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
96
+ tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token
97
+
98
+ # Save the quantized model
99
+ quant_path = "/content/quantized_Llama-3.1-8B-Instruct"
100
+ model.save_pretrained(quant_path)
101
+ tokenizer.save_pretrained(quant_path)
102
+ ```
103
+
104
+ ## Requirements
105
+
106
+ - **Hardware**: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100)
107
+ - **Python**: 3.10+
108
+ - **Dependencies**:
109
+ - `transformers==4.45.1`
110
+ - `bitsandbytes==0.43.3`
111
+ - `accelerate==0.33.0`
112
+ - `torch` (with CUDA support)
113
+
114
+ ## Notes
115
+
116
+ - The quantized model is stored in `/content/quantized_Llama-3.1-8B-Instruct` in the Colab environment.
117
+ - Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence.
118
+ - Access to the base model requires a Hugging Face token and approval from Meta AI.
119
+
120
+ ## License
121
+
122
+ This model inherits the license of the base model `meta-llama/Llama-3.1-8B-Instruct`. Refer to the original model card: [Meta AI Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
123
+
124
+ ## Acknowledgments
125
+
126
+ - Created using Hugging Face Transformers and `bitsandbytes` for quantization.
127
+ - Quantized in Google Colab with a T4 GPU on June 20, 2025.