--- license: apache-2.0 tags: - awq - quantization - 4bit - llm - llama library_name: transformers --- # Llama-3.1-8B-Instruct – AWQ 4-bit This repository contains a **4-bit AWQ quantized version** of **Llama-3.1-8B-Instruct**. The model is optimized for **lower memory usage and faster inference** with minimal quality loss. --- ## 🔹 Model Details - **Base Model:** meta-llama/Llama-3.1-8B-Instruct - **Quantization Method:** AWQ (Activation-aware Weight Quantization) - **Precision:** 4-bit - **Framework:** PyTorch - **Quantized Using:** LLM Compressor - **Intended Use:** Text generation, chat, instruction following --- ## 🔹 Why AWQ? AWQ reduces model size and VRAM usage by: - Quantizing weights to 4-bit - Preserving important activation ranges - Maintaining better accuracy compared to naive quantization --- ## 🔹 Hardware Requirements | Type | Requirement | |-----|------------| | GPU | 8–10 GB VRAM (recommended) | | CPU | Supported (slower) | | RAM | 16 GB or more | --- ## 🔹 How to Load the Model ### Using Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "your-username/your-model" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16 ) prompt = "Explain transformers in simple words" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))