File size: 3,663 Bytes
ff69d25
a967838
 
ff69d25
8de18f7
ff69d25
 
a967838
 
ff69d25
a6bb318
68d9058
8c1e8b3
68d9058
 
af3a616
f347901
68d9058
 
 
 
 
a6bb318
68d9058
c6c4ae7
68d9058
c6c4ae7
1c40d36
c6c4ae7
 
 
68d9058
c6c4ae7
68d9058
c6c4ae7
68d9058
c6c4ae7
 
 
 
 
a6bb318
c6c4ae7
 
 
68d9058
c6c4ae7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42b3895
 
a6bb318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42b3895
 
eb5e86e
42b3895
 
 
664e158
 
 
 
 
 
 
42b3895
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
base_model:
- EuroBERT/EuroBERT-210m
datasets:
- hgissbkh/BERTJudge-Dataset
language:
- en
library_name: transformers
pipeline_tag: text-classification
---
# BERTJudge

BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.

## Model Summary
- **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://huggingface.co/papers/2604.09497)
- **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
- **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
- **Language:** English

## Intended Use

BERTJudge models are designed as sequence classifiers that output a sigmoid score reflecting answer correctness. For inference, we suggest using the [BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) package. 

### Installation

```zsh
git clone https://github.com/artefactory/BERT-as-a-Judge.git
cd BERT-as-a-Judge
pip install -e .
```

### Usage

Example:

```python
from bert_judge.judges import BERTJudge

# 1) Initialize the judge
judge = BERTJudge(
    model_path="artefactory/BERTJudge",
    trust_remote_code=True,
    dtype="bfloat16",
)

# 2) Define one question, one reference, and several candidate answers
question = "What is the capital of France?"
reference = "Paris"
candidates = [
    "Paris.",
    "The capital of France is Paris.",
    "I'm hesitating between Paris and London. I would say Paris.",
    "London.",
    "The capital of France is London.",
    "I'm hesitating between Paris and London. I would say London.",
]

# 3) Predict scores (one score per candidate)
scores = judge.predict(
    questions=[question] * len(candidates),
    references=[reference] * len(candidates),
    candidates=candidates,
    batch_size=1,
)

print(scores)
```

## Naming Convention Breakdown

Models follow a standardized naming structure: `BERTJudge-<Candidate_Format>-<Input_Structure>-<Additional_Info>`.

* **Candidate Format:**
  * `Free`: Trained on unconstrained model generations.
  * `Formatted`: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with `"Final answer: <final_answer>"` (see the paper for details).
* **Input Structure:**
  * `QCR`: The input sequence consists of [Question, Candidate, Reference].
  * `CR`: The input sequence consists only of [Candidate, Reference].
* **Additional Info:**
  * `OOD`: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training).
  * `100k/200k/500k`: Denotes the total training steps (default regime being 1 million).

**Note: For optimal evaluation performance, we recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.**

## Citation

If you find this model useful for your research, please consider citing:

```
@article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
  title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
  author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre},
  year={2026},
  eprint={2604.09497},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.09497}
}
```