---
license: apache-2.0
tags:
- awq
- quantization
- 4bit
- llm
- llama
library_name: transformers
---

# Llama-3.1-8B-Instruct – AWQ 4-bit

This repository contains a **4-bit AWQ quantized version** of **Llama-3.1-8B-Instruct**.
The model is optimized for **lower memory usage and faster inference** with minimal quality loss.

---

## 🔹 Model Details

- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
- **Quantization Method:** AWQ (Activation-aware Weight Quantization)
- **Precision:** 4-bit
- **Framework:** PyTorch
- **Quantized Using:** LLM Compressor
- **Intended Use:** Text generation, chat, instruction following

---

## 🔹 Why AWQ?

AWQ reduces model size and VRAM usage by:
- Quantizing weights to 4-bit
- Preserving important activation ranges
- Maintaining better accuracy compared to naive quantization

---

## 🔹 Hardware Requirements

| Type | Requirement |
|-----|------------|
| GPU | 8–10 GB VRAM (recommended) |
| CPU | Supported (slower) |
| RAM | 16 GB or more |

---

## 🔹 How to Load the Model

### Using Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "your-username/your-model"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

prompt = "Explain transformers in simple words"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))