| --- |
| base_model: |
| - EuroBERT/EuroBERT-210m |
| datasets: |
| - hgissbkh/BERTJudge-Dataset |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: text-classification |
| --- |
| # BERTJudge |
|
|
| BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches. |
|
|
| ## Model Summary |
| - **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://huggingface.co/papers/2604.09497) |
| - **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) |
| - **Model Type:** Encoder-based Judge (EuroBERT-210m backbone) |
| - **Language:** English |
|
|
| ## Intended Use |
|
|
| BERTJudge models are designed as sequence classifiers that output a sigmoid score reflecting answer correctness. For inference, we suggest using the [BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) package. |
|
|
| ### Installation |
|
|
| ```zsh |
| git clone https://github.com/artefactory/BERT-as-a-Judge.git |
| cd BERT-as-a-Judge |
| pip install -e . |
| ``` |
|
|
| ### Usage |
|
|
| Example: |
|
|
| ```python |
| from bert_judge.judges import BERTJudge |
| |
| # 1) Initialize the judge |
| judge = BERTJudge( |
| model_path="artefactory/BERTJudge", |
| trust_remote_code=True, |
| dtype="bfloat16", |
| ) |
| |
| # 2) Define one question, one reference, and several candidate answers |
| question = "What is the capital of France?" |
| reference = "Paris" |
| candidates = [ |
| "Paris.", |
| "The capital of France is Paris.", |
| "I'm hesitating between Paris and London. I would say Paris.", |
| "London.", |
| "The capital of France is London.", |
| "I'm hesitating between Paris and London. I would say London.", |
| ] |
| |
| # 3) Predict scores (one score per candidate) |
| scores = judge.predict( |
| questions=[question] * len(candidates), |
| references=[reference] * len(candidates), |
| candidates=candidates, |
| batch_size=1, |
| ) |
| |
| print(scores) |
| ``` |
|
|
| ## Naming Convention Breakdown |
|
|
| Models follow a standardized naming structure: `BERTJudge-<Candidate_Format>-<Input_Structure>-<Additional_Info>`. |
|
|
| * **Candidate Format:** |
| * `Free`: Trained on unconstrained model generations. |
| * `Formatted`: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with `"Final answer: <final_answer>"` (see the paper for details). |
| * **Input Structure:** |
| * `QCR`: The input sequence consists of [Question, Candidate, Reference]. |
| * `CR`: The input sequence consists only of [Candidate, Reference]. |
| * **Additional Info:** |
| * `OOD`: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training). |
| * `100k/200k/500k`: Denotes the total training steps (default regime being 1 million). |
|
|
| **Note: For optimal evaluation performance, we recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.** |
|
|
| ## Citation |
|
|
| If you find this model useful for your research, please consider citing: |
|
|
| ``` |
| @article{gisserotboukhlef2026bertasajudgerobustalternativelexical, |
| title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation}, |
| author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre}, |
| year={2026}, |
| eprint={2604.09497}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2604.09497} |
| } |
| ``` |