Prometheus 8x7B v2.0 — W4A16 (GPTQ)
Weight-only 4-bit GPTQ quantization of prometheus-eval/prometheus-8x7b-v2.0, produced with llm-compressor for efficient inference in vLLM.
The base model is an LLM-as-a-judge specialized for rubric-based evaluation of language model outputs. This quantized build reduces the ~93 GB FP16 weight footprint to fit comfortably on a single 48 GB GPU.
Quick Facts
| Property | Value |
|---|---|
| Base model | prometheus-eval/prometheus-8x7b-v2.0 (Mixtral 8x7B MoE) |
| Quantization scheme | W4A16 (INT4 weights, FP16 activations) |
| Algorithm | GPTQ |
| Tool | llm-compressor 0.10 |
| Calibration samples | 128 |
| Size on disk | ~26 GB |
| Minimum VRAM for inference | ~28 GB (fits on single L40S / A100 40GB) |
| License | Apache 2.0 (inherited from base model) |
Usage
vLLM (recommended)
vllm serve wjacksonrd/prometheus-8x7b-v2.0-W4A16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--dtype bfloat16
Then query via the OpenAI-compatible API:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "wjacksonrd/prometheus-8x7b-v2.0-W4A16",
"prompt": "[INST] ... your judge prompt ... [/INST]",
"max_tokens": 1024,
"temperature": 0.0
}'
Follow the Prometheus 2 prompt format — the model expects rubric-formatted evaluation prompts.
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("wjacksonrd/prometheus-8x7b-v2.0-W4A16")
model = AutoModelForCausalLM.from_pretrained(
"wjacksonrd/prometheus-8x7b-v2.0-W4A16",
device_map="auto",
)
Quantization Recipe
Produced with llm-compressor 0.10.0.1:
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
)
oneshot(
model="prometheus-eval/prometheus-8x7b-v2.0",
dataset=calibration_dataset,
recipe=recipe,
num_calibration_samples=128,
)
The MoE expert layers and attention projections are quantized; lm_head is left in full precision.
Limitations
Inherits all limitations of the base model — see the base model card for details.
Additional quantization-specific notes:
- 4-bit weight quantization introduces small accuracy variance. Validate against your own held-out set before using in production.
- Calibration was performed on a narrow domain of judge-style rubric prompts; performance in substantially different evaluation domains may differ from the FP16 baseline. If that matters for your use case, calibrate your own build from the FP16 source.
Citation
Please cite the original Prometheus 2 work:
@article{kim2024prometheus2,
title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
author={Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon},
journal={arXiv preprint arXiv:2405.01535},
year={2024}
}
Acknowledgements
- Prometheus Eval team for the base model
- LLM Compressor for the quantization toolchain
- vLLM for efficient serving of quantized MoE models
- Downloads last month
- 2
Model tree for wjacksonrd/prometheus-8x7b-v2.0-W4A16
Base model
prometheus-eval/prometheus-8x7b-v2.0