curiousmind147 commited on
Commit
a809170
Β·
verified Β·
1 Parent(s): 5821646

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Microsoft Phi-4 4-bit AWQ Quantized Model (GEMM)
2
+
3
+ This is a **4-bit AutoAWQ quantized version** of [Microsoft's Phi-4](https://huggingface.co/microsoft/phi-4).
4
+ It is optimized for **fast inference** using **vLLM** with minimal loss in accuracy.
5
+
6
+ ---
7
+ ## πŸš€ Model Details
8
+
9
+ - **Base Model:** [microsoft/phi-4](https://huggingface.co/microsoft/phi-4)
10
+ - **Quantization:** **4-bit AWQ**
11
+ - **Quantization Method:** **AutoAWQ (Activation-Aware Quantization)**
12
+ - **Group Size:** 128
13
+ - **AWQ Version:** GEMM Optimized
14
+ - **Intended Use:** **Low VRAM inference on consumer GPUs**
15
+ - **VRAM Requirements:** βœ… **8GB+ (Recommended)**
16
+ - **Compatibility:** βœ… **vLLM, Hugging Face Transformers (w/ AWQ support)**
17
+
18
+ ---
19
+
20
+ ## πŸ“Œ How to Use in vLLM
21
+
22
+ You can load this model directly in **vLLM** for efficient inference:
23
+
24
+ ```bash
25
+ vllm serve "curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM"
26
+ ```
27
+
28
+ Then, test it using `cURL`:
29
+
30
+ ```bash
31
+ curl -X POST "http://localhost:8000/generate" \
32
+ -H "Content-Type: application/json" \
33
+ -d '{"prompt": "Explain quantum mechanics in simple terms.", "max_tokens": 100}'
34
+ ```
35
+
36
+ ---
37
+
38
+ ## πŸ“Œ How to Use in Python (`transformers` + AWQ)
39
+
40
+ To use this model with **Hugging Face Transformers**:
41
+ ```python
42
+ from awq import AutoAWQForCausalLM
43
+ from transformers import AutoTokenizer
44
+
45
+ model_path = "curiousmind147/microsoft-phi-4-AWQ-4bit-GEMM"
46
+ model = AutoAWQForCausalLM.from_pretrained(model_path)
47
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
48
+
49
+ inputs = tokenizer("What is the meaning of life?", return_tensors="pt")
50
+ output = model.generate(**inputs, max_new_tokens=100)
51
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
52
+ ```
53
+
54
+ ---
55
+
56
+ ## πŸ“Œ Quantization Details
57
+
58
+ This model was quantized using **AutoAWQ** with the following parameters:
59
+
60
+ - **Bits:** 4-bit quantization
61
+ - **Zero-Point Quantization:** Enabled (`zero_point=True`)
62
+ - **Group Size:** 128 (`q_group_size=128`)
63
+ - **Quantization Version:** `GEMM`
64
+ - **Method Used:** [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
65
+
66
+ ---
67
+
68
+ ## πŸ“Œ VRAM Requirements
69
+
70
+ | Model Size | **FP16 (No Quant)** | **AWQ 4-bit Quantized** |
71
+ |------------|-------------------|-------------------------|
72
+ | **Phi-4 14B** | ❌ Requires **>20GB VRAM** | βœ… **8GB-12GB VRAM** |
73
+
74
+ AWQ significantly **reduces VRAM requirements**, making it **possible to run 14B models on consumer GPUs**. πŸš€
75
+
76
+ ---
77
+
78
+ ## πŸ“Œ License & Credits
79
+
80
+ - **Base Model:** [Microsoft Phi-4](https://huggingface.co/microsoft/phi-4)
81
+ - **Quantized by:** [curiousmind147](https://huggingface.co/curiousmind147)
82
+ - **License:** Same as the base model (Microsoft)
83
+ - **Credits:** This model is based on Microsoft's Phi-4 and was optimized using AutoAWQ.
84
+
85
+ ---
86
+
87
+ ## πŸ“Œ Acknowledgments
88
+
89
+ Special thanks to:
90
+ - **Microsoft** for creating [Phi-4](https://huggingface.co/microsoft/phi-4).
91
+ - **Casper Hansen** for developing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
92
+ - **The vLLM team** for making fast inference possible.
93
+
94
+ ---
95
+
96
+ ## πŸš€ Enjoy Efficient Phi-4 Inference!
97
+ If you find this useful, **give it a ⭐ on Hugging Face!** 🎯
98
+