| | --- |
| | license: apache-2.0 |
| | tags: |
| | - awq |
| | - quantization |
| | - 4bit |
| | - llm |
| | - llama |
| | library_name: transformers |
| | --- |
| | |
| | # Llama-3.1-8B-Instruct โ AWQ 4-bit |
| |
|
| | This repository contains a **4-bit AWQ quantized version** of **Llama-3.1-8B-Instruct**. |
| | The model is optimized for **lower memory usage and faster inference** with minimal quality loss. |
| |
|
| | --- |
| |
|
| | ## ๐น Model Details |
| |
|
| | - **Base Model:** meta-llama/Llama-3.1-8B-Instruct |
| | - **Quantization Method:** AWQ (Activation-aware Weight Quantization) |
| | - **Precision:** 4-bit |
| | - **Framework:** PyTorch |
| | - **Quantized Using:** LLM Compressor |
| | - **Intended Use:** Text generation, chat, instruction following |
| |
|
| | --- |
| |
|
| | ## ๐น Why AWQ? |
| |
|
| | AWQ reduces model size and VRAM usage by: |
| | - Quantizing weights to 4-bit |
| | - Preserving important activation ranges |
| | - Maintaining better accuracy compared to naive quantization |
| |
|
| | --- |
| |
|
| | ## ๐น Hardware Requirements |
| |
|
| | | Type | Requirement | |
| | |-----|------------| |
| | | GPU | 8โ10 GB VRAM (recommended) | |
| | | CPU | Supported (slower) | |
| | | RAM | 16 GB or more | |
| |
|
| | --- |
| |
|
| | ## ๐น How to Load the Model |
| |
|
| | ### Using Transformers |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | model_id = "your-username/your-model" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | device_map="auto", |
| | torch_dtype=torch.float16 |
| | ) |
| | |
| | prompt = "Explain transformers in simple words" |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate(**inputs, max_new_tokens=200) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | |