BERTJudge-Formatted-QCR

BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.

Model Summary

Intended Use

BERTJudge models are designed as sequence classifiers that output a sigmoid score reflecting answer correctness. For inference, we suggest using the BERT-as-a-Judge package.

Installation

git clone https://github.com/artefactory/BERT-as-a-Judge.git
cd BERT-as-a-Judge
pip install -e .

Usage

Example:

from bert_judge.judges import BERTJudge

# 1) Initialize the judge
judge = BERTJudge(
    model_path="artefactory/BERTJudge",
    trust_remote_code=True,
    dtype="bfloat16",
)

# 2) Define one question, one reference, and several candidate answers
question = "What is the capital of France?"
reference = "Paris"
candidates = [
    "Paris.",
    "The capital of France is Paris.",
    "I'm hesitating between Paris and London. I would say Paris.",
    "London.",
    "The capital of France is London.",
    "I'm hesitating between Paris and London. I would say London.",
]

# 3) Predict scores (one score per candidate)
scores = judge.predict(
    questions=[question] * len(candidates),
    references=[reference] * len(candidates),
    candidates=candidates,
    batch_size=1,
)

print(scores)

Naming Convention Breakdown

Models follow a standardized naming structure: BERTJudge-<Candidate_Format>-<Input_Structure>-<Additional_Info>.

  • Candidate Format:
    • Free: Trained on unconstrained model generations.
    • Formatted: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with "Final answer: <final_answer>" (see the paper for details).
  • Input Structure:
    • QCR: The input sequence consists of [Question, Candidate, Reference].
    • CR: The input sequence consists only of [Candidate, Reference].
  • Additional Info:
    • OOD: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training).
    • 100k/200k/500k: Denotes the total training steps (default regime being 1 million).

Note: For optimal evaluation performance, we recommend using BERTJudge-Free-QCR, available as artefactory/BERTJudge.

Citation

If you find this model useful for your research, please consider citing:

@article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
  title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
  author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre},
  year={2026},
  eprint={2604.09497},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.09497}
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for artefactory/BERTJudge-Formatted-QCR

Finetuned
(65)
this model

Dataset used to train artefactory/BERTJudge-Formatted-QCR

Collection including artefactory/BERTJudge-Formatted-QCR

Paper for artefactory/BERTJudge-Formatted-QCR