logiya-vidhyapathi
/

llama_quantization_4_bit

Text Generation

text-generation-inference

Model card Files Files and versions

llama_quantization_4_bit / README.md

logiya-vidhyapathi's picture

logiya-vidhyapathi

Create README.md

be522d6 verified 29 days ago

|

history blame contribute delete

1.59 kB

	---
	license: apache-2.0
	tags:
	- awq
	- quantization
	- 4bit
	- llm
	- llama
	library_name: transformers
	---

	# Llama-3.1-8B-Instruct – AWQ 4-bit

	This repository contains a 4-bit AWQ quantized version of Llama-3.1-8B-Instruct.
	The model is optimized for lower memory usage and faster inference with minimal quality loss.

	---

	## 🔹 Model Details

	- Base Model: meta-llama/Llama-3.1-8B-Instruct
	- Quantization Method: AWQ (Activation-aware Weight Quantization)
	- Precision: 4-bit
	- Framework: PyTorch
	- Quantized Using: LLM Compressor
	- Intended Use: Text generation, chat, instruction following

	---

	## 🔹 Why AWQ?

	AWQ reduces model size and VRAM usage by:
	- Quantizing weights to 4-bit
	- Preserving important activation ranges
	- Maintaining better accuracy compared to naive quantization

	---

	## 🔹 Hardware Requirements

	\| Type \| Requirement \|
	\|-----\|------------\|
	\| GPU \| 8–10 GB VRAM (recommended) \|
	\| CPU \| Supported (slower) \|
	\| RAM \| 16 GB or more \|

	---

	## 🔹 How to Load the Model

	### Using Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "your-username/your-model"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype=torch.float16
	)

	prompt = "Explain transformers in simple words"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=200)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))