--- datasets: - hgissbkh/BERTJudge-Dataset language: - en base_model: - EuroBERT/EuroBERT-210m --- # BERTJudge-Formatted-CR BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches. ## Model Summary - **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://arxiv.org/abs/2604.09497) - **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) - **Model Type:** Encoder-based Judge (EuroBERT-210m backbone) - **Language:** English ## Intended Use BERTJudge models are designed as sequence classifiers that output a sigmoid score reflecting answer correctness. For inference, we suggest using the [BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge) package. ### Installation ```zsh git clone https://github.com/artefactory/BERT-as-a-Judge.git cd BERT-as-a-Judge pip install -e . ``` ### Usage Example: ```python from bert_judge.judges import BERTJudge # 1) Initialize the judge judge = BERTJudge( model_path="artefactory/BERTJudge", trust_remote_code=True, dtype="bfloat16", ) # 2) Define one question, one reference, and several candidate answers question = "What is the capital of France?" reference = "Paris" candidates = [ "Paris.", "The capital of France is Paris.", "I'm hesitating between Paris and London. I would say Paris.", "London.", "The capital of France is London.", "I'm hesitating between Paris and London. I would say London.", ] # 3) Predict scores (one score per candidate) scores = judge.predict( questions=[question] * len(candidates), references=[reference] * len(candidates), candidates=candidates, batch_size=1, ) print(scores) ``` ## Naming Convention Breakdown Models follow a standardized naming structure: `BERTJudge---`. * **Candidate Format:** * `Free`: Trained on unconstrained model generations. * `Formatted`: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with `"Final answer: "` (see the paper for details). * **Input Structure:** * `QCR`: The input sequence consists of [Question, Candidate, Reference]. * `CR`: The input sequence consists only of [Candidate, Reference]. * **Additional Info:** * `OOD`: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training). * `100k/200k/500k`: Denotes the total training steps (default regime being 1 million). **Note: For optimal evaluation performance, we recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.** ## Citation If you find this model useful for your research, please consider citing: ``` @article{gisserotboukhlef2026bertasajudgerobustalternativelexical, title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation}, author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre}, year={2026}, eprint={2604.09497}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.09497} } ```