Phi-2 โ€” GPTQ 4-bit

Self-quantized GPTQ 4-bit checkpoint of microsoft/phi-2 with fully documented calibration provenance.

Created as part of the Banterhearts research program investigating quality-safety correlation under quantization for consumer LLM deployment.

Base model microsoft/phi-2
Parameters 2.78B
Architecture MHA, parallel attn+MLP, 32 layers
Quantization GPTQ 4-bit, group_size=128
Model size 1.8 GB
VRAM required ~2.3 GB (inference)

Quantization Details

Parameter Value
Method GPTQ
Tool gptqmodel 5.8.0
Bits 4
Group size 128
Scheme Symmetric (4-bit, INT32 packing)
Calibration dataset allenai/c4 (en, shard 1 of 1024)
Calibration samples 128
Seed 42
Quantization time 691s
Hardware NVIDIA RTX 4080 Laptop (12 GB) via Docker

Why Self-Quantized?

Pre-quantized checkpoints on HuggingFace typically have unknown calibration provenance โ€” the dataset, sample count, seed, and group size are rarely documented. This checkpoint was self-quantized with controlled, documented settings to enable rigorous cross-method comparison (GGUF k-quant vs AWQ vs GPTQ) in a NeurIPS 2026 submission on quality-safety correlation under quantization.

Evaluation Results

Evaluated on 735 quality samples across 7 tasks and 468 safety samples judged by gemma3:12b.

Quality Metrics (generation tasks)

Metric Score
BERTScore (F1) 0.747
ROUGE-L 0.543
Coherence 0.708

Accuracy (capability tasks)

Task Accuracy
MMLU 54.4%
ARC Challenge 71.0%
Classification 80.0%

Safety Metrics (gemma3:12b judge)

Metric Score
Refusal Rate (AdvBench) 32.0%
Truthfulness (TruthfulQA) 26.0%
Unbiased Rate (BBQ) 22.7%

Other Quantization Formats

Format Repository
Original FP16 microsoft/phi-2

Why No AWQ Variant?

AWQ quantization fails on phi-2 due to its parallel attention+MLP architecture, which causes NaN in the smoothing grid search. AWQ assumes sequential attention->layernorm->MLP data flow; phi-2 processes attention and MLP in parallel from the same layernorm output. GPTQ works because it quantizes each layer independently via Hessian-based error minimization without cross-layer smoothing assumptions.

Prompt Template

Instruct: {prompt}
Output:

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Crusadersk/phi-2-gptq-4bit",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Crusadersk/phi-2-gptq-4bit")

inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference requirements: pip install gptqmodel (Linux only) or optimum+auto-gptq

Windows users: GPTQ inference requires gptqmodel which only builds on Linux. Use Docker or WSL2. See reproduction instructions below.

Compatibility

Framework Supported
Transformers Yes
vLLM Yes (GPTQ backend)
llama.cpp No (use GGUF format instead)
Ollama No (use GGUF format instead)
Windows (native) No โ€” requires Linux/Docker

Reproduction

The full quantization pipeline โ€” Dockerfiles, quantization scripts, and a 766-line engineering log documenting every platform failure and solution โ€” is available at:

research/tr142/expansion/

in the Banterhearts repository. Key files:

File Purpose
QUANTIZATION_LOG.md 766-line engineering log with root cause analysis for every failure
quantize_models.py CLI for AWQ + GPTQ quantization with skip-existing and manifests
Dockerfile.gptq / Dockerfile.awq Separate Docker images (irreconcilable dependency conflict)
smoke_test.py Checkpoint verification with automatic Docker fallback for GPTQ
run_hf_eval.py HuggingFace .generate() evaluation backend

Citation

@misc{banterhearts2026phi2gptq,
  title = {Self-Quantized Phi-2 (GPTQ 4-bit) for Quality-Safety Correlation Research},
  author = {Kadadekar, Sahil},
  year = {2026},
  url = {https://huggingface.co/Crusadersk/phi-2-gptq-4bit},
  note = {Part of the Banterhearts research program. NeurIPS 2026 submission.}
}

Acknowledgments

This work is part of a 40-TR research program on consumer LLM deployment safety, conducted independently as pre-doctoral research. Full program details at github.com/Sahil170595/Banterhearts.

Downloads last month
29
Safetensors
Model size
3B params
Tensor type
F16
ยท
I32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Crusadersk/phi-2-gptq-4bit

Base model

microsoft/phi-2
Quantized
(57)
this model