logiya-vidhyapathi's picture
Create README.md
be522d6 verified
---
license: apache-2.0
tags:
- awq
- quantization
- 4bit
- llm
- llama
library_name: transformers
---
# Llama-3.1-8B-Instruct โ€“ AWQ 4-bit
This repository contains a **4-bit AWQ quantized version** of **Llama-3.1-8B-Instruct**.
The model is optimized for **lower memory usage and faster inference** with minimal quality loss.
---
## ๐Ÿ”น Model Details
- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
- **Quantization Method:** AWQ (Activation-aware Weight Quantization)
- **Precision:** 4-bit
- **Framework:** PyTorch
- **Quantized Using:** LLM Compressor
- **Intended Use:** Text generation, chat, instruction following
---
## ๐Ÿ”น Why AWQ?
AWQ reduces model size and VRAM usage by:
- Quantizing weights to 4-bit
- Preserving important activation ranges
- Maintaining better accuracy compared to naive quantization
---
## ๐Ÿ”น Hardware Requirements
| Type | Requirement |
|-----|------------|
| GPU | 8โ€“10 GB VRAM (recommended) |
| CPU | Supported (slower) |
| RAM | 16 GB or more |
---
## ๐Ÿ”น How to Load the Model
### Using Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "your-username/your-model"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16
)
prompt = "Explain transformers in simple words"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))