README.md · curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM at main

microsoft-phi-4-AWQ-4bit-GEMM / README.md

curiousmind147

Update README.md

646a0c6 verified about 1 year ago

preview code

raw

history blame contribute delete

3.18 kB

	---
	license: mit
	language:
	- en
	base_model:
	- microsoft/phi-4
	tags:
	- 4bit
	- transformers
	- autoawq
	- vllm
	- 12gb-vram
	---
	# Microsoft Phi-4 4-bit AWQ Quantized Model (GEMM)

	This is a 4-bit AutoAWQ quantized version of [Microsoft's Phi-4](https://huggingface.co/microsoft/phi-4).
	It is optimized for fast inference using vLLM with minimal loss in accuracy.

	---
	## 🚀 Model Details

	- Base Model: [microsoft/phi-4](https://huggingface.co/microsoft/phi-4)
	- Quantization: 4-bit AWQ
	- Quantization Method: AutoAWQ (Activation-Aware Quantization)
	- Group Size: 128
	- AWQ Version: GEMM Optimized
	- Intended Use: Low VRAM inference on consumer GPUs
	- VRAM Requirements: ✅ 8GB+ (Recommended)
	- Compatibility: ✅ vLLM, Hugging Face Transformers (w/ AWQ support)

	---

	## 📌 How to Use in vLLM

	You can load this model directly in vLLM for efficient inference:

	```bash
	vllm serve "curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM"
	```

	Then, test it using `cURL`:

	```bash
	curl -X POST "http://localhost:8000/generate" \
	-H "Content-Type: application/json" \
	-d '{"prompt": "Explain quantum mechanics in simple terms.", "max_tokens": 100}'
	```

	---

	## 📌 How to Use in Python (`transformers` + AWQ)

	To use this model with Hugging Face Transformers:
	```python
	from awq import AutoAWQForCausalLM
	from transformers import AutoTokenizer

	model_path = "curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM"
	model = AutoAWQForCausalLM.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	inputs = tokenizer("What is the meaning of life?", return_tensors="pt")
	output = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	---

	## 📌 Quantization Details

	This model was quantized using AutoAWQ with the following parameters:

	- Bits: 4-bit quantization
	- Zero-Point Quantization: Enabled (`zero_point=True`)
	- Group Size: 128 (`q_group_size=128`)
	- Quantization Version: `GEMM`
	- Method Used: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)

	---

	## 📌 VRAM Requirements

	\| Model Size \| FP16 (No Quant) \| AWQ 4-bit Quantized \|
	\|------------\|-------------------\|-------------------------\|
	\| Phi-4 14B \| ❌ Requires >20GB VRAM \| ✅ 8GB-12GB VRAM \|

	AWQ significantly reduces VRAM requirements, making it possible to run 14B models on consumer GPUs. 🚀

	---

	## 📌 License & Credits

	- Base Model: [Microsoft Phi-4](https://huggingface.co/microsoft/phi-4)
	- Quantized by: [curiousmind147](https://huggingface.co/curiousmind147)
	- License: Same as the base model (Microsoft)
	- Credits: This model is based on Microsoft's Phi-4 and was optimized using AutoAWQ.

	---

	## 📌 Acknowledgments

	Special thanks to:
	- Microsoft for creating [Phi-4](https://huggingface.co/microsoft/phi-4).
	- Casper Hansen for developing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
	- The vLLM team for making fast inference possible.

	---

	## 🚀 Enjoy Efficient Phi-4 Inference!
	If you find this useful, give it a ⭐ on Hugging Face! 🎯