Mistral 7B Instruct v0.3 — AWQ 4-bit
Self-quantized AWQ 4-bit checkpoint of mistralai/Mistral-7B-Instruct-v0.3 with fully documented calibration provenance.
Created as part of the Banterhearts research program investigating quality-safety correlation under quantization for consumer LLM deployment.
| Base model | mistralai/Mistral-7B-Instruct-v0.3 |
| Parameters | 7.24B |
| Architecture | GQA, 32 layers, 32 heads, 8 KV heads |
| Quantization | AWQ W4A16 asymmetric, group_size=128 |
| Model size | 3.9 GB |
| VRAM required | ~5 GB (inference) |
Quantization Details
| Parameter | Value |
|---|---|
| Method | AWQ |
| Tool | llmcompressor |
| Bits | 4 |
| Group size | 128 |
| Scheme | W4A16_ASYM (4-bit weights, 16-bit activations, asymmetric) |
| Calibration dataset | Salesforce/wikitext (wikitext-103-raw-v1) |
| Calibration samples | 128 |
| Seed | 42 |
| Quantization time | 1003s |
| Hardware | RunPod RTX 6000 Ada (48 GB) |
Why Self-Quantized?
Pre-quantized checkpoints on HuggingFace typically have unknown calibration provenance — the dataset, sample count, seed, and group size are rarely documented. This checkpoint was self-quantized with controlled, documented settings to enable rigorous cross-method comparison (GGUF k-quant vs AWQ vs GPTQ) in a NeurIPS 2026 submission on quality-safety correlation under quantization.
Evaluation Results
Evaluation pending — quality and safety benchmarks will be run on this checkpoint and results updated here.
Other Quantization Formats
| Format | Repository |
|---|---|
| Original FP16 | mistralai/Mistral-7B-Instruct-v0.3 |
| GPTQ 4-bit | Crusadersk/mistral-7b-gptq-4bit |
Prompt Template
[INST] {prompt} [/INST]
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Crusadersk/mistral-7b-awq-4bit",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Crusadersk/mistral-7b-awq-4bit")
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Compatibility
| Framework | Supported |
|---|---|
| Transformers | Yes |
| vLLM | Yes (AWQ backend) |
| llama.cpp | No (use GGUF format instead) |
| Ollama | No (use GGUF format instead) |
| Windows (native) | No — requires Linux/Docker |
Reproduction
The full quantization pipeline — Dockerfiles, quantization scripts, and a 766-line engineering log documenting every platform failure and solution — is available at:
research/tr142/expansion/
in the Banterhearts repository.
Citation
@misc{banterhearts2026mistral7bawq,
title = {Self-Quantized Mistral 7B Instruct v0.3 (AWQ 4-bit) for Quality-Safety Correlation Research},
author = {Kadadekar, Sahil},
year = {2026},
url = {https://huggingface.co/Crusadersk/mistral-7b-awq-4bit},
note = {Part of the Banterhearts research program. NeurIPS 2026 submission.}
}
Acknowledgments
This work is part of a 40-TR research program on consumer LLM deployment safety, conducted independently as pre-doctoral research. Full program details at github.com/Sahil170595/Banterhearts.
- Downloads last month
- -
Model tree for Crusadersk/mistral-7b-awq-4bit
Base model
mistralai/Mistral-7B-v0.3