Phi-2 โ GPTQ 4-bit
Self-quantized GPTQ 4-bit checkpoint of microsoft/phi-2 with fully documented calibration provenance.
Created as part of the Banterhearts research program investigating quality-safety correlation under quantization for consumer LLM deployment.
| Base model | microsoft/phi-2 |
| Parameters | 2.78B |
| Architecture | MHA, parallel attn+MLP, 32 layers |
| Quantization | GPTQ 4-bit, group_size=128 |
| Model size | 1.8 GB |
| VRAM required | ~2.3 GB (inference) |
Quantization Details
| Parameter | Value |
|---|---|
| Method | GPTQ |
| Tool | gptqmodel 5.8.0 |
| Bits | 4 |
| Group size | 128 |
| Scheme | Symmetric (4-bit, INT32 packing) |
| Calibration dataset | allenai/c4 (en, shard 1 of 1024) |
| Calibration samples | 128 |
| Seed | 42 |
| Quantization time | 691s |
| Hardware | NVIDIA RTX 4080 Laptop (12 GB) via Docker |
Why Self-Quantized?
Pre-quantized checkpoints on HuggingFace typically have unknown calibration provenance โ the dataset, sample count, seed, and group size are rarely documented. This checkpoint was self-quantized with controlled, documented settings to enable rigorous cross-method comparison (GGUF k-quant vs AWQ vs GPTQ) in a NeurIPS 2026 submission on quality-safety correlation under quantization.
Evaluation Results
Evaluated on 735 quality samples across 7 tasks and 468 safety samples judged by gemma3:12b.
Quality Metrics (generation tasks)
| Metric | Score |
|---|---|
| BERTScore (F1) | 0.747 |
| ROUGE-L | 0.543 |
| Coherence | 0.708 |
Accuracy (capability tasks)
| Task | Accuracy |
|---|---|
| MMLU | 54.4% |
| ARC Challenge | 71.0% |
| Classification | 80.0% |
Safety Metrics (gemma3:12b judge)
| Metric | Score |
|---|---|
| Refusal Rate (AdvBench) | 32.0% |
| Truthfulness (TruthfulQA) | 26.0% |
| Unbiased Rate (BBQ) | 22.7% |
Other Quantization Formats
| Format | Repository |
|---|---|
| Original FP16 | microsoft/phi-2 |
Why No AWQ Variant?
AWQ quantization fails on phi-2 due to its parallel attention+MLP architecture, which causes NaN in the smoothing grid search. AWQ assumes sequential attention->layernorm->MLP data flow; phi-2 processes attention and MLP in parallel from the same layernorm output. GPTQ works because it quantizes each layer independently via Hessian-based error minimization without cross-layer smoothing assumptions.
Prompt Template
Instruct: {prompt}
Output:
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Crusadersk/phi-2-gptq-4bit",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Crusadersk/phi-2-gptq-4bit")
inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Inference requirements: pip install gptqmodel (Linux only) or optimum+auto-gptq
Windows users: GPTQ inference requires
gptqmodelwhich only builds on Linux. Use Docker or WSL2. See reproduction instructions below.
Compatibility
| Framework | Supported |
|---|---|
| Transformers | Yes |
| vLLM | Yes (GPTQ backend) |
| llama.cpp | No (use GGUF format instead) |
| Ollama | No (use GGUF format instead) |
| Windows (native) | No โ requires Linux/Docker |
Reproduction
The full quantization pipeline โ Dockerfiles, quantization scripts, and a 766-line engineering log documenting every platform failure and solution โ is available at:
research/tr142/expansion/
in the Banterhearts repository. Key files:
| File | Purpose |
|---|---|
QUANTIZATION_LOG.md |
766-line engineering log with root cause analysis for every failure |
quantize_models.py |
CLI for AWQ + GPTQ quantization with skip-existing and manifests |
Dockerfile.gptq / Dockerfile.awq |
Separate Docker images (irreconcilable dependency conflict) |
smoke_test.py |
Checkpoint verification with automatic Docker fallback for GPTQ |
run_hf_eval.py |
HuggingFace .generate() evaluation backend |
Citation
@misc{banterhearts2026phi2gptq,
title = {Self-Quantized Phi-2 (GPTQ 4-bit) for Quality-Safety Correlation Research},
author = {Kadadekar, Sahil},
year = {2026},
url = {https://huggingface.co/Crusadersk/phi-2-gptq-4bit},
note = {Part of the Banterhearts research program. NeurIPS 2026 submission.}
}
Acknowledgments
This work is part of a 40-TR research program on consumer LLM deployment safety, conducted independently as pre-doctoral research. Full program details at github.com/Sahil170595/Banterhearts.
- Downloads last month
- 29
Model tree for Crusadersk/phi-2-gptq-4bit
Base model
microsoft/phi-2