BERTJudge / README.md
hgissbkh's picture
Update README.md
af3a616 verified
---
base_model:
- EuroBERT/EuroBERT-210m
datasets:
- hgissbkh/BERTJudge-Dataset
language:
- en
library_name: transformers
pipeline_tag: text-classification
---
# BERTJudge
BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.
## Model Summary
- **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://huggingface.co/papers/2604.09497)
- **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
- **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
- **Language:** English
## Intended Use
BERTJudge models are designed as sequence classifiers that output a sigmoid score reflecting answer correctness. For inference, we suggest using the [BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) package.
### Installation
```zsh
git clone https://github.com/artefactory/BERT-as-a-Judge.git
cd BERT-as-a-Judge
pip install -e .
```
### Usage
Example:
```python
from bert_judge.judges import BERTJudge
# 1) Initialize the judge
judge = BERTJudge(
model_path="artefactory/BERTJudge",
trust_remote_code=True,
dtype="bfloat16",
)
# 2) Define one question, one reference, and several candidate answers
question = "What is the capital of France?"
reference = "Paris"
candidates = [
"Paris.",
"The capital of France is Paris.",
"I'm hesitating between Paris and London. I would say Paris.",
"London.",
"The capital of France is London.",
"I'm hesitating between Paris and London. I would say London.",
]
# 3) Predict scores (one score per candidate)
scores = judge.predict(
questions=[question] * len(candidates),
references=[reference] * len(candidates),
candidates=candidates,
batch_size=1,
)
print(scores)
```
## Naming Convention Breakdown
Models follow a standardized naming structure: `BERTJudge-<Candidate_Format>-<Input_Structure>-<Additional_Info>`.
* **Candidate Format:**
* `Free`: Trained on unconstrained model generations.
* `Formatted`: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with `"Final answer: <final_answer>"` (see the paper for details).
* **Input Structure:**
* `QCR`: The input sequence consists of [Question, Candidate, Reference].
* `CR`: The input sequence consists only of [Candidate, Reference].
* **Additional Info:**
* `OOD`: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training).
* `100k/200k/500k`: Denotes the total training steps (default regime being 1 million).
**Note: For optimal evaluation performance, we recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.**
## Citation
If you find this model useful for your research, please consider citing:
```
@article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre},
year={2026},
eprint={2604.09497},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.09497}
}
```