Instructions to use Crusadersk/mistral-7b-gptq-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Crusadersk/mistral-7b-gptq-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Crusadersk/mistral-7b-gptq-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Crusadersk/mistral-7b-gptq-4bit")
model = AutoModelForCausalLM.from_pretrained("Crusadersk/mistral-7b-gptq-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Crusadersk/mistral-7b-gptq-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Crusadersk/mistral-7b-gptq-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Crusadersk/mistral-7b-gptq-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Crusadersk/mistral-7b-gptq-4bit

SGLang

How to use Crusadersk/mistral-7b-gptq-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Crusadersk/mistral-7b-gptq-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Crusadersk/mistral-7b-gptq-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Crusadersk/mistral-7b-gptq-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Crusadersk/mistral-7b-gptq-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Crusadersk/mistral-7b-gptq-4bit with Docker Model Runner:
```
docker model run hf.co/Crusadersk/mistral-7b-gptq-4bit
```

Mistral 7B Instruct v0.3 â€” GPTQ 4-bit

Self-quantized GPTQ 4-bit checkpoint of mistralai/Mistral-7B-Instruct-v0.3 with fully documented calibration provenance.

Created as part of the Banterhearts research program investigating quality-safety correlation under quantization for consumer LLM deployment.


Base model	mistralai/Mistral-7B-Instruct-v0.3
Parameters	7.24B
Architecture	GQA, 32 layers, 32 heads, 8 KV heads
Quantization	GPTQ 4-bit, group_size=128
Model size	3.9 GB
VRAM required	~5 GB (inference)

Quantization Details

Parameter	Value
Method	GPTQ
Tool	gptqmodel
Bits	4
Group size	128
Scheme	Symmetric (4-bit, INT32 packing)
Calibration dataset	allenai/c4 (en, shard 1 of 1024)
Calibration samples	128
Seed	42
Quantization time	542s
Hardware	RunPod RTX 6000 Ada (48 GB)

Why Self-Quantized?

Pre-quantized checkpoints on HuggingFace typically have unknown calibration provenance â€” the dataset, sample count, seed, and group size are rarely documented. This checkpoint was self-quantized with controlled, documented settings to enable rigorous cross-method comparison (GGUF k-quant vs AWQ vs GPTQ) in a NeurIPS 2026 submission on quality-safety correlation under quantization.

Evaluation Results

Evaluation pending â€” quality and safety benchmarks will be run on this checkpoint and results updated here.

Other Quantization Formats

Format	Repository
Original FP16	mistralai/Mistral-7B-Instruct-v0.3
AWQ 4-bit	Crusadersk/mistral-7b-awq-4bit

Prompt Template

[INST] {prompt} [/INST]

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Crusadersk/mistral-7b-gptq-4bit",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Crusadersk/mistral-7b-gptq-4bit")

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference requirements: pip install gptqmodel (Linux only) or optimum+auto-gptq

Windows users: GPTQ inference requires gptqmodel which only builds on Linux. Use Docker or WSL2.

Compatibility

Framework	Supported
Transformers	Yes
vLLM	Yes (GPTQ backend)
llama.cpp	No (use GGUF format instead)
Ollama	No (use GGUF format instead)
Windows (native)	No â€” requires Linux/Docker

Reproduction

The full quantization pipeline â€” Dockerfiles, quantization scripts, and a 766-line engineering log documenting every platform failure and solution â€” is available at:

research/tr142/expansion/

in the Banterhearts repository.

Citation

@misc{banterhearts2026mistral7bgptq,
  title = {Self-Quantized Mistral 7B Instruct v0.3 (GPTQ 4-bit) for Quality-Safety Correlation Research},
  author = {Kadadekar, Sahil},
  year = {2026},
  url = {https://huggingface.co/Crusadersk/mistral-7b-gptq-4bit},
  note = {Part of the Banterhearts research program. NeurIPS 2026 submission.}
}

Acknowledgments

This work is part of the Chimera/Banterhearts technical-report program on deployment-time LLM behavior, quantization, refusal robustness, batching effects, and inference-stack reliability. Canonical public archive: Chimeraforge Reports; source context: github.com/Sahil170595/Banterhearts.