logiya-vidhyapathi
/

llama_quantization_4_bit

Text Generation

text-generation-inference

Model card Files Files and versions

logiya-vidhyapathi commited on Jan 30

Commit

be522d6

·

verified ·

1 Parent(s): 1d9d94c

Create README.md

Files changed (1) hide show

README.md +70 -0

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+---
+license: apache-2.0
+tags:
+- awq
+- quantization
+- 4bit
+- llm
+- llama
+library_name: transformers
+---
+# Llama-3.1-8B-Instruct – AWQ 4-bit
+This repository contains a **4-bit AWQ quantized version** of **Llama-3.1-8B-Instruct**.
+The model is optimized for **lower memory usage and faster inference** with minimal quality loss.
+---
+## 🔹 Model Details
+- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
+- **Quantization Method:** AWQ (Activation-aware Weight Quantization)
+- **Precision:** 4-bit
+- **Framework:** PyTorch
+- **Quantized Using:** LLM Compressor
+- **Intended Use:** Text generation, chat, instruction following
+---
+## 🔹 Why AWQ?
+AWQ reduces model size and VRAM usage by:
+- Quantizing weights to 4-bit
+- Preserving important activation ranges
+- Maintaining better accuracy compared to naive quantization
+---
+## 🔹 Hardware Requirements
+| Type | Requirement |
+|-----|------------|
+| GPU | 8–10 GB VRAM (recommended) |
+| CPU | Supported (slower) |
+| RAM | 16 GB or more |
+---
+## 🔹 How to Load the Model
+### Using Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "your-username/your-model"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.float16
+)
+prompt = "Explain transformers in simple words"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=200)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))