logiya-vidhyapathi commited on
Commit
be522d6
·
verified ·
1 Parent(s): 1d9d94c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - awq
5
+ - quantization
6
+ - 4bit
7
+ - llm
8
+ - llama
9
+ library_name: transformers
10
+ ---
11
+
12
+ # Llama-3.1-8B-Instruct – AWQ 4-bit
13
+
14
+ This repository contains a **4-bit AWQ quantized version** of **Llama-3.1-8B-Instruct**.
15
+ The model is optimized for **lower memory usage and faster inference** with minimal quality loss.
16
+
17
+ ---
18
+
19
+ ## 🔹 Model Details
20
+
21
+ - **Base Model:** meta-llama/Llama-3.1-8B-Instruct
22
+ - **Quantization Method:** AWQ (Activation-aware Weight Quantization)
23
+ - **Precision:** 4-bit
24
+ - **Framework:** PyTorch
25
+ - **Quantized Using:** LLM Compressor
26
+ - **Intended Use:** Text generation, chat, instruction following
27
+
28
+ ---
29
+
30
+ ## 🔹 Why AWQ?
31
+
32
+ AWQ reduces model size and VRAM usage by:
33
+ - Quantizing weights to 4-bit
34
+ - Preserving important activation ranges
35
+ - Maintaining better accuracy compared to naive quantization
36
+
37
+ ---
38
+
39
+ ## 🔹 Hardware Requirements
40
+
41
+ | Type | Requirement |
42
+ |-----|------------|
43
+ | GPU | 8–10 GB VRAM (recommended) |
44
+ | CPU | Supported (slower) |
45
+ | RAM | 16 GB or more |
46
+
47
+ ---
48
+
49
+ ## 🔹 How to Load the Model
50
+
51
+ ### Using Transformers
52
+
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForCausalLM
55
+ import torch
56
+
57
+ model_id = "your-username/your-model"
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
60
+ model = AutoModelForCausalLM.from_pretrained(
61
+ model_id,
62
+ device_map="auto",
63
+ torch_dtype=torch.float16
64
+ )
65
+
66
+ prompt = "Explain transformers in simple words"
67
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
68
+
69
+ outputs = model.generate(**inputs, max_new_tokens=200)
70
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))