Instructions to use Crusadersk/phi-2-gptq-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Crusadersk/phi-2-gptq-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Crusadersk/phi-2-gptq-4bit")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Crusadersk/phi-2-gptq-4bit")
model = AutoModelForCausalLM.from_pretrained("Crusadersk/phi-2-gptq-4bit")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Crusadersk/phi-2-gptq-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Crusadersk/phi-2-gptq-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Crusadersk/phi-2-gptq-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Crusadersk/phi-2-gptq-4bit

SGLang

How to use Crusadersk/phi-2-gptq-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Crusadersk/phi-2-gptq-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Crusadersk/phi-2-gptq-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Crusadersk/phi-2-gptq-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Crusadersk/phi-2-gptq-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Crusadersk/phi-2-gptq-4bit with Docker Model Runner:
```
docker model run hf.co/Crusadersk/phi-2-gptq-4bit
```

Phi-2 â€” GPTQ 4-bit

Deployment safety notice

QuantSafe release action: ROUTE.

I retrospectively screened the immutable weight revision 6385e88d733f... against its matched FP16 baseline. The frozen screen reports RTSI 0.6199 (HIGH) and a raw refusal-rate shift from 91% to 1% (-90 percentage points).

This result does not prove that the checkpoint is unsafe. It is the publisher's release-gate decision: do not deploy this revision without direct safety evaluation; route to a safer baseline when the deployment cannot absorb that evaluation.

Open the QuantSafe evidence Â· Read the methodology

This README-only notice does not modify the screened model weights.

Self-quantized GPTQ 4-bit checkpoint of microsoft/phi-2 with fully documented calibration provenance.

Created as part of the Banterhearts research program investigating quality-safety correlation under quantization for consumer LLM deployment.


Base model	microsoft/phi-2
Parameters	2.78B
Architecture	MHA, parallel attn+MLP, 32 layers
Quantization	GPTQ 4-bit, group_size=128
Model size	1.8 GB
VRAM required	~2.3 GB (inference)

Quantization Details

Parameter	Value
Method	GPTQ
Tool	gptqmodel 5.8.0
Bits	4
Group size	128
Scheme	Symmetric (4-bit, INT32 packing)
Calibration dataset	allenai/c4 (en, shard 1 of 1024)
Calibration samples	128
Seed	42
Quantization time	691s
Hardware	NVIDIA RTX 4080 Laptop (12 GB) via Docker

Why Self-Quantized?

Pre-quantized checkpoints on HuggingFace typically have unknown calibration provenance â€” the dataset, sample count, seed, and group size are rarely documented. This checkpoint was self-quantized with controlled, documented settings to enable rigorous cross-method comparison (GGUF k-quant vs AWQ vs GPTQ) in a NeurIPS 2026 submission on quality-safety correlation under quantization.

Evaluation Results

Evaluated on 735 quality samples across 7 tasks and 468 safety samples judged by gemma3:12b.

Quality Metrics (generation tasks)

Metric	Score
BERTScore (F1)	0.747
ROUGE-L	0.543
Coherence	0.708

Accuracy (capability tasks)

Task	Accuracy
MMLU	54.4%
ARC Challenge	71.0%
Classification	80.0%

Safety Metrics (gemma3:12b judge)

Metric	Score
Refusal Rate (AdvBench)	32.0%
Truthfulness (TruthfulQA)	26.0%
Unbiased Rate (BBQ)	22.7%

Other Quantization Formats

Format	Repository
Original FP16	microsoft/phi-2

Why No AWQ Variant?

AWQ quantization fails on phi-2 due to its parallel attention+MLP architecture, which causes NaN in the smoothing grid search. AWQ assumes sequential attention->layernorm->MLP data flow; phi-2 processes attention and MLP in parallel from the same layernorm output. GPTQ works because it quantizes each layer independently via Hessian-based error minimization without cross-layer smoothing assumptions.

Prompt Template

Instruct: {prompt}
Output:

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Crusadersk/phi-2-gptq-4bit",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Crusadersk/phi-2-gptq-4bit")

inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference requirements: pip install gptqmodel (Linux only) or optimum+auto-gptq

Windows users: GPTQ inference requires gptqmodel which only builds on Linux. Use Docker or WSL2. See reproduction instructions below.

Compatibility

Framework	Supported
Transformers	Yes
vLLM	Yes (GPTQ backend)
llama.cpp	No (use GGUF format instead)
Ollama	No (use GGUF format instead)
Windows (native)	No â€” requires Linux/Docker

Reproduction

The full quantization pipeline â€” Dockerfiles, quantization scripts, and a 766-line engineering log documenting every platform failure and solution â€” is available at:

research/tr142/expansion/

in the Banterhearts repository. Key files:

File	Purpose
`QUANTIZATION_LOG.md`	766-line engineering log with root cause analysis for every failure
`quantize_models.py`	CLI for AWQ + GPTQ quantization with skip-existing and manifests
`Dockerfile.gptq` / `Dockerfile.awq`	Separate Docker images (irreconcilable dependency conflict)
`smoke_test.py`	Checkpoint verification with automatic Docker fallback for GPTQ
`run_hf_eval.py`	HuggingFace .generate() evaluation backend

Citation

@misc{banterhearts2026phi2gptq,
  title = {Self-Quantized Phi-2 (GPTQ 4-bit) for Quality-Safety Correlation Research},
  author = {Kadadekar, Sahil},
  year = {2026},
  url = {https://huggingface.co/Crusadersk/phi-2-gptq-4bit},
  note = {Part of the Banterhearts research program. NeurIPS 2026 submission.}
}

Acknowledgments

This work is part of the Chimera/Banterhearts technical-report program on deployment-time LLM behavior, quantization, refusal robustness, batching effects, and inference-stack reliability. Canonical public archive: Chimeraforge Reports; source context: github.com/Sahil170595/Banterhearts.