File size: 2,000 Bytes

511a0a9
 
83ea5bc
 
 
 
511a0a9
 
 
 
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
 
 
 
 
b62c197
511a0a9
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
 
 
511a0a9
83ea5bc
 
511a0a9
83ea5bc
 
511a0a9
83ea5bc
 
511a0a9
83ea5bc
 
511a0a9
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
 
 
511a0a9
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
 
 
511a0a9
83ea5bc
511a0a9
83ea5bc
511a0a9
83ea5bc
 
 
511a0a9
83ea5bc

---
library_name: transformers
license: mit
base_model:
- microsoft/phi-2
pipeline_tag: text-generation
---



# 🧠 Phi-2 (4-bit Quantized with AutoRound)

This is a 4-bit quantized version of the [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) model using Intel's [AutoRound](https://github.com/intel/auto-round) for weight-only post-training quantization (W4G128). It achieves significant compression while preserving model performance, making it ideal for resource-constrained inference.

---

## 🧾 Model Details

* **Base model:** microsoft/phi-2
* **Quantization method:** AutoRound (W4G128 - 4-bit, group size 128)
* **Framework:** 🤗 Transformers
* **Precision:** 4-bit weights
* **Quantized size:** \~1.85 GB (original: \~5.5 GB)
* **Compression ratio:** \~67%

---

## 🚀 How to Use

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("itachi023/phi-2-4-bit-quantized", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("itachi023/phi-2-4-bit-quantized")

prompt = "write a essay on deep learning"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 📦 Intended Uses

* Fast inference with low memory footprint
* Deployment on consumer GPUs or edge devices
* Offline assistants, document generation, or chatbots

---

## ⚠️ Limitations

* This model has not been fine-tuned post-quantization.
* Slight accuracy drop may occur vs. full-precision, especially on sensitive NLP tasks.
* Phi-2 is a pretrained model without alignment or safety tuning.

---

## 📈 Performance Notes

* **Quantization config:** W4G128 (4-bit, symmetric), 512 calibration samples, 1000 iterations
* **AutoRound version:** Latest (as of May 2025)
* **Target device:** GPU (A100/L4), float16 scale

---