Prometheus 8x7B v2.0 — W4A16 (GPTQ)

Weight-only 4-bit GPTQ quantization of prometheus-eval/prometheus-8x7b-v2.0, produced with llm-compressor for efficient inference in vLLM.

The base model is an LLM-as-a-judge specialized for rubric-based evaluation of language model outputs. This quantized build reduces the ~93 GB FP16 weight footprint to fit comfortably on a single 48 GB GPU.

Quick Facts

Property	Value
Base model	`prometheus-eval/prometheus-8x7b-v2.0` (Mixtral 8x7B MoE)
Quantization scheme	W4A16 (INT4 weights, FP16 activations)
Algorithm	GPTQ
Tool	`llm-compressor` 0.10
Calibration samples	128
Size on disk	~26 GB
Minimum VRAM for inference	~28 GB (fits on single L40S / A100 40GB)
License	Apache 2.0 (inherited from base model)

Usage

vLLM (recommended)

vllm serve wjacksonrd/prometheus-8x7b-v2.0-W4A16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --dtype bfloat16

Then query via the OpenAI-compatible API:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wjacksonrd/prometheus-8x7b-v2.0-W4A16",
    "prompt": "[INST] ... your judge prompt ... [/INST]",
    "max_tokens": 1024,
    "temperature": 0.0
  }'

Follow the Prometheus 2 prompt format — the model expects rubric-formatted evaluation prompts.

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("wjacksonrd/prometheus-8x7b-v2.0-W4A16")
model = AutoModelForCausalLM.from_pretrained(
    "wjacksonrd/prometheus-8x7b-v2.0-W4A16",
    device_map="auto",
)

Quantization Recipe

Produced with llm-compressor 0.10.0.1:

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot

recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
)

oneshot(
    model="prometheus-eval/prometheus-8x7b-v2.0",
    dataset=calibration_dataset,
    recipe=recipe,
    num_calibration_samples=128,
)

The MoE expert layers and attention projections are quantized; lm_head is left in full precision.

Limitations

Inherits all limitations of the base model — see the base model card for details.

Additional quantization-specific notes:

4-bit weight quantization introduces small accuracy variance. Validate against your own held-out set before using in production.
Calibration was performed on a narrow domain of judge-style rubric prompts; performance in substantially different evaluation domains may differ from the FP16 baseline. If that matters for your use case, calibrate your own build from the FP16 source.

Citation

Please cite the original Prometheus 2 work:

@article{kim2024prometheus2,
  title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
  author={Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon},
  journal={arXiv preprint arXiv:2405.01535},
  year={2024}
}

Acknowledgements

Prometheus Eval team for the base model
LLM Compressor for the quantization toolchain
vLLM for efficient serving of quantized MoE models

Downloads last month: 3

Safetensors

Model size

6B params

Tensor type

I64

I32

BF16

Model tree for wjacksonrd/prometheus-8x7b-v2.0-W4A16

Base model

prometheus-eval/prometheus-8x7b-v2.0

Quantized

(4)

this model

Paper for wjacksonrd/prometheus-8x7b-v2.0-W4A16

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2, 2024 • 124