|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
base_model: |
|
|
- microsoft/phi-2 |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# ๐ง Phi-2 (4-bit Quantized with AutoRound) |
|
|
|
|
|
This is a 4-bit quantized version of the [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) model using Intel's [AutoRound](https://github.com/intel/auto-round) for weight-only post-training quantization (W4G128). It achieves significant compression while preserving model performance, making it ideal for resource-constrained inference. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐งพ Model Details |
|
|
|
|
|
* **Base model:** microsoft/phi-2 |
|
|
* **Quantization method:** AutoRound (W4G128 - 4-bit, group size 128) |
|
|
* **Framework:** ๐ค Transformers |
|
|
* **Precision:** 4-bit weights |
|
|
* **Quantized size:** \~1.85 GB (original: \~5.5 GB) |
|
|
* **Compression ratio:** \~67% |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("itachi023/phi-2-4-bit-quantized", torch_dtype=torch.float16, device_map="auto") |
|
|
tokenizer = AutoTokenizer.from_pretrained("itachi023/phi-2-4-bit-quantized") |
|
|
|
|
|
prompt = "write a essay on deep learning" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฆ Intended Uses |
|
|
|
|
|
* Fast inference with low memory footprint |
|
|
* Deployment on consumer GPUs or edge devices |
|
|
* Offline assistants, document generation, or chatbots |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ Limitations |
|
|
|
|
|
* This model has not been fine-tuned post-quantization. |
|
|
* Slight accuracy drop may occur vs. full-precision, especially on sensitive NLP tasks. |
|
|
* Phi-2 is a pretrained model without alignment or safety tuning. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Performance Notes |
|
|
|
|
|
* **Quantization config:** W4G128 (4-bit, symmetric), 512 calibration samples, 1000 iterations |
|
|
* **AutoRound version:** Latest (as of May 2025) |
|
|
* **Target device:** GPU (A100/L4), float16 scale |
|
|
|
|
|
--- |