Prometheus 8x7B v2.0 — W4A16 (GPTQ)

Weight-only 4-bit GPTQ quantization of prometheus-eval/prometheus-8x7b-v2.0, produced with llm-compressor for efficient inference in vLLM.

The base model is an LLM-as-a-judge specialized for rubric-based evaluation of language model outputs. This quantized build reduces the ~93 GB FP16 weight footprint to fit comfortably on a single 48 GB GPU.

Quick Facts

Property Value
Base model prometheus-eval/prometheus-8x7b-v2.0 (Mixtral 8x7B MoE)
Quantization scheme W4A16 (INT4 weights, FP16 activations)
Algorithm GPTQ
Tool llm-compressor 0.10
Calibration samples 128
Size on disk ~26 GB
Minimum VRAM for inference ~28 GB (fits on single L40S / A100 40GB)
License Apache 2.0 (inherited from base model)

Usage

vLLM (recommended)

vllm serve wjacksonrd/prometheus-8x7b-v2.0-W4A16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --dtype bfloat16

Then query via the OpenAI-compatible API:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wjacksonrd/prometheus-8x7b-v2.0-W4A16",
    "prompt": "[INST] ... your judge prompt ... [/INST]",
    "max_tokens": 1024,
    "temperature": 0.0
  }'

Follow the Prometheus 2 prompt format — the model expects rubric-formatted evaluation prompts.

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("wjacksonrd/prometheus-8x7b-v2.0-W4A16")
model = AutoModelForCausalLM.from_pretrained(
    "wjacksonrd/prometheus-8x7b-v2.0-W4A16",
    device_map="auto",
)

Quantization Recipe

Produced with llm-compressor 0.10.0.1:

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot

recipe = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
)

oneshot(
    model="prometheus-eval/prometheus-8x7b-v2.0",
    dataset=calibration_dataset,
    recipe=recipe,
    num_calibration_samples=128,
)

The MoE expert layers and attention projections are quantized; lm_head is left in full precision.

Limitations

Inherits all limitations of the base model — see the base model card for details.

Additional quantization-specific notes:

  • 4-bit weight quantization introduces small accuracy variance. Validate against your own held-out set before using in production.
  • Calibration was performed on a narrow domain of judge-style rubric prompts; performance in substantially different evaluation domains may differ from the FP16 baseline. If that matters for your use case, calibrate your own build from the FP16 source.

Citation

Please cite the original Prometheus 2 work:

@article{kim2024prometheus2,
  title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
  author={Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon},
  journal={arXiv preprint arXiv:2405.01535},
  year={2024}
}

Acknowledgements

Downloads last month
2
Safetensors
Model size
6B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wjacksonrd/prometheus-8x7b-v2.0-W4A16

Quantized
(4)
this model

Paper for wjacksonrd/prometheus-8x7b-v2.0-W4A16