Update README.md

af3a616 verified about 11 hours ago

3.66 kB

	---
	base_model:
	- EuroBERT/EuroBERT-210m
	datasets:
	- hgissbkh/BERTJudge-Dataset
	language:
	- en
	library_name: transformers
	pipeline_tag: text-classification
	---
	# BERTJudge

	BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.

	## Model Summary
	- Paper: [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://huggingface.co/papers/2604.09497)
	- Code: [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
	- Model Type: Encoder-based Judge (EuroBERT-210m backbone)
	- Language: English

	## Intended Use

	BERTJudge models are designed as sequence classifiers that output a sigmoid score reflecting answer correctness. For inference, we suggest using the [BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) package.

	### Installation

	```zsh
	git clone https://github.com/artefactory/BERT-as-a-Judge.git
	cd BERT-as-a-Judge
	pip install -e .
	```

	### Usage

	Example:

	```python
	from bert_judge.judges import BERTJudge

	# 1) Initialize the judge
	judge = BERTJudge(
	model_path="artefactory/BERTJudge",
	trust_remote_code=True,
	dtype="bfloat16",
	)

	# 2) Define one question, one reference, and several candidate answers
	question = "What is the capital of France?"
	reference = "Paris"
	candidates = [
	"Paris.",
	"The capital of France is Paris.",
	"I'm hesitating between Paris and London. I would say Paris.",
	"London.",
	"The capital of France is London.",
	"I'm hesitating between Paris and London. I would say London.",
	]

	# 3) Predict scores (one score per candidate)
	scores = judge.predict(
	questions=[question] * len(candidates),
	references=[reference] * len(candidates),
	candidates=candidates,
	batch_size=1,
	)

	print(scores)
	```

	## Naming Convention Breakdown

	Models follow a standardized naming structure: `BERTJudge-<Candidate_Format>-<Input_Structure>-<Additional_Info>`.

	* Candidate Format:
	* `Free`: Trained on unconstrained model generations.
	* `Formatted`: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with `"Final answer: <final_answer>"` (see the paper for details).
	* Input Structure:
	* `QCR`: The input sequence consists of [Question, Candidate, Reference].
	* `CR`: The input sequence consists only of [Candidate, Reference].
	* Additional Info:
	* `OOD`: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training).
	* `100k/200k/500k`: Denotes the total training steps (default regime being 1 million).

	Note: For optimal evaluation performance, we recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.

	## Citation

	If you find this model useful for your research, please consider citing:

	```
	@article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
	title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
	author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre},
	year={2026},
	eprint={2604.09497},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.09497}
	}
	```