Instructions to use artefactory/BERTJudge-Formatted-CR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use artefactory/BERTJudge-Formatted-CR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="artefactory/BERTJudge-Formatted-CR", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("artefactory/BERTJudge-Formatted-CR", trust_remote_code=True) model = AutoModelForSequenceClassification.from_pretrained("artefactory/BERTJudge-Formatted-CR", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Add pipeline tag, library name and improve model card
Browse filesHi, I'm Niels from the Hugging Face community science team.
I've opened this PR to improve the model card for **BERTJudge-Formatted-CR**:
- Added the `text-classification` pipeline tag and `transformers` library name to ensure the model is correctly categorized and the "Use in Transformers" button works.
- Added `license: apache-2.0` metadata.
- Improved the Markdown content by linking the [original paper](https://huggingface.co/papers/2604.09497) and the [GitHub repository](https://github.com/artefactory/BERT-as-a-Judge).
- Added a usage section based on the provided GitHub documentation to help researchers get started with the `bert-judge` library.
Please let me know if you'd like me to adjust anything!
|
@@ -1,17 +1,23 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
datasets:
|
| 3 |
- hgissbkh/BERTJudge-Dataset
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
-
|
| 7 |
-
|
|
|
|
| 8 |
---
|
|
|
|
| 9 |
# BERTJudge-Formatted-CR
|
| 10 |
|
| 11 |
BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.
|
| 12 |
|
|
|
|
|
|
|
| 13 |
## Model Summary
|
| 14 |
-
- **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://
|
| 15 |
- **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
|
| 16 |
- **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
|
| 17 |
- **Language:** English
|
|
@@ -22,7 +28,7 @@ BERTJudge models are designed as sequence classifiers that output a sigmoid scor
|
|
| 22 |
|
| 23 |
### Installation
|
| 24 |
|
| 25 |
-
```
|
| 26 |
git clone https://github.com/artefactory/BERT-as-a-Judge.git
|
| 27 |
cd BERT-as-a-Judge
|
| 28 |
pip install -e .
|
|
@@ -30,33 +36,30 @@ pip install -e .
|
|
| 30 |
|
| 31 |
### Usage
|
| 32 |
|
| 33 |
-
Example:
|
| 34 |
|
| 35 |
```python
|
| 36 |
from bert_judge.judges import BERTJudge
|
| 37 |
|
| 38 |
# 1) Initialize the judge
|
| 39 |
judge = BERTJudge(
|
| 40 |
-
model_path="
|
| 41 |
trust_remote_code=True,
|
| 42 |
dtype="bfloat16",
|
| 43 |
)
|
| 44 |
|
| 45 |
-
# 2) Define
|
| 46 |
-
|
| 47 |
reference = "Paris"
|
| 48 |
candidates = [
|
| 49 |
"Paris.",
|
| 50 |
"The capital of France is Paris.",
|
| 51 |
-
"I'm hesitating between Paris and London. I would say Paris.",
|
| 52 |
"London.",
|
| 53 |
-
"The capital of France is London.",
|
| 54 |
-
"I'm hesitating between Paris and London. I would say London.",
|
| 55 |
]
|
| 56 |
|
| 57 |
# 3) Predict scores (one score per candidate)
|
| 58 |
scores = judge.predict(
|
| 59 |
-
questions=[
|
| 60 |
references=[reference] * len(candidates),
|
| 61 |
candidates=candidates,
|
| 62 |
batch_size=1,
|
|
@@ -71,24 +74,24 @@ Models follow a standardized naming structure: `BERTJudge-<Candidate_Format>-<In
|
|
| 71 |
|
| 72 |
* **Candidate Format:**
|
| 73 |
* `Free`: Trained on unconstrained model generations.
|
| 74 |
-
* `Formatted`: Trained on outputs that adhere to specific structural constraints
|
| 75 |
* **Input Structure:**
|
| 76 |
* `QCR`: The input sequence consists of [Question, Candidate, Reference].
|
| 77 |
* `CR`: The input sequence consists only of [Candidate, Reference].
|
| 78 |
* **Additional Info:**
|
| 79 |
-
* `OOD`: Indicates evaluation of Out-of-Distribution performance
|
| 80 |
-
* `100k/200k/500k`: Denotes the total training steps (default
|
| 81 |
|
| 82 |
-
**Note: For optimal evaluation performance,
|
| 83 |
|
| 84 |
## Citation
|
| 85 |
|
| 86 |
If you find this model useful for your research, please consider citing:
|
| 87 |
|
| 88 |
-
```
|
| 89 |
@article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
|
| 90 |
title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
|
| 91 |
-
author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre},
|
| 92 |
year={2026},
|
| 93 |
eprint={2604.09497},
|
| 94 |
archivePrefix={arXiv},
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- EuroBERT/EuroBERT-210m
|
| 4 |
datasets:
|
| 5 |
- hgissbkh/BERTJudge-Dataset
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
+
library_name: transformers
|
| 9 |
+
pipeline_tag: text-classification
|
| 10 |
+
license: apache-2.0
|
| 11 |
---
|
| 12 |
+
|
| 13 |
# BERTJudge-Formatted-CR
|
| 14 |
|
| 15 |
BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.
|
| 16 |
|
| 17 |
+
This specific variant, **BERTJudge-Formatted-CR**, is optimized for evaluating candidate answers that adhere to specific structural constraints (formatted) and utilizes the **[Candidate, Reference]** input structure.
|
| 18 |
+
|
| 19 |
## Model Summary
|
| 20 |
+
- **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://huggingface.co/papers/2604.09497)
|
| 21 |
- **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
|
| 22 |
- **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
|
| 23 |
- **Language:** English
|
|
|
|
| 28 |
|
| 29 |
### Installation
|
| 30 |
|
| 31 |
+
```bash
|
| 32 |
git clone https://github.com/artefactory/BERT-as-a-Judge.git
|
| 33 |
cd BERT-as-a-Judge
|
| 34 |
pip install -e .
|
|
|
|
| 36 |
|
| 37 |
### Usage
|
| 38 |
|
| 39 |
+
Example using the `bert_judge` library:
|
| 40 |
|
| 41 |
```python
|
| 42 |
from bert_judge.judges import BERTJudge
|
| 43 |
|
| 44 |
# 1) Initialize the judge
|
| 45 |
judge = BERTJudge(
|
| 46 |
+
model_path="hgissbkh/BERTJudge-Formatted-CR",
|
| 47 |
trust_remote_code=True,
|
| 48 |
dtype="bfloat16",
|
| 49 |
)
|
| 50 |
|
| 51 |
+
# 2) Define a reference and several candidate answers
|
| 52 |
+
# Note: For CR models, the question is not used in the sequence
|
| 53 |
reference = "Paris"
|
| 54 |
candidates = [
|
| 55 |
"Paris.",
|
| 56 |
"The capital of France is Paris.",
|
|
|
|
| 57 |
"London.",
|
|
|
|
|
|
|
| 58 |
]
|
| 59 |
|
| 60 |
# 3) Predict scores (one score per candidate)
|
| 61 |
scores = judge.predict(
|
| 62 |
+
questions=[""] * len(candidates),
|
| 63 |
references=[reference] * len(candidates),
|
| 64 |
candidates=candidates,
|
| 65 |
batch_size=1,
|
|
|
|
| 74 |
|
| 75 |
* **Candidate Format:**
|
| 76 |
* `Free`: Trained on unconstrained model generations.
|
| 77 |
+
* `Formatted`: Trained on outputs that adhere to specific structural constraints (ideally concluding with `"Final answer: <final_answer>"`).
|
| 78 |
* **Input Structure:**
|
| 79 |
* `QCR`: The input sequence consists of [Question, Candidate, Reference].
|
| 80 |
* `CR`: The input sequence consists only of [Candidate, Reference].
|
| 81 |
* **Additional Info:**
|
| 82 |
+
* `OOD`: Indicates evaluation of Out-of-Distribution performance.
|
| 83 |
+
* `100k/200k/500k`: Denotes the total training steps (default is 1 million).
|
| 84 |
|
| 85 |
+
**Note: For optimal general evaluation performance, the authors recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.**
|
| 86 |
|
| 87 |
## Citation
|
| 88 |
|
| 89 |
If you find this model useful for your research, please consider citing:
|
| 90 |
|
| 91 |
+
```bibtex
|
| 92 |
@article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
|
| 93 |
title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
|
| 94 |
+
author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\\'e}line and Colombo, Pierre},
|
| 95 |
year={2026},
|
| 96 |
eprint={2604.09497},
|
| 97 |
archivePrefix={arXiv},
|