--- library_name: transformers license: mit base_model: - microsoft/phi-2 pipeline_tag: text-generation --- # ๐Ÿง  Phi-2 (4-bit Quantized with AutoRound) This is a 4-bit quantized version of the [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) model using Intel's [AutoRound](https://github.com/intel/auto-round) for weight-only post-training quantization (W4G128). It achieves significant compression while preserving model performance, making it ideal for resource-constrained inference. --- ## ๐Ÿงพ Model Details * **Base model:** microsoft/phi-2 * **Quantization method:** AutoRound (W4G128 - 4-bit, group size 128) * **Framework:** ๐Ÿค— Transformers * **Precision:** 4-bit weights * **Quantized size:** \~1.85 GB (original: \~5.5 GB) * **Compression ratio:** \~67% --- ## ๐Ÿš€ How to Use ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model = AutoModelForCausalLM.from_pretrained("itachi023/phi-2-4-bit-quantized", torch_dtype=torch.float16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("itachi023/phi-2-4-bit-quantized") prompt = "write a essay on deep learning" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## ๐Ÿ“ฆ Intended Uses * Fast inference with low memory footprint * Deployment on consumer GPUs or edge devices * Offline assistants, document generation, or chatbots --- ## โš ๏ธ Limitations * This model has not been fine-tuned post-quantization. * Slight accuracy drop may occur vs. full-precision, especially on sensitive NLP tasks. * Phi-2 is a pretrained model without alignment or safety tuning. --- ## ๐Ÿ“ˆ Performance Notes * **Quantization config:** W4G128 (4-bit, symmetric), 512 calibration samples, 1000 iterations * **AutoRound version:** Latest (as of May 2025) * **Target device:** GPU (A100/L4), float16 scale ---