Add pipeline tag, library name and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +21 -18
README.md CHANGED
@@ -1,17 +1,23 @@
1
  ---
 
 
2
  datasets:
3
  - hgissbkh/BERTJudge-Dataset
4
  language:
5
  - en
6
- base_model:
7
- - EuroBERT/EuroBERT-210m
 
8
  ---
 
9
  # BERTJudge-Formatted-CR
10
 
11
  BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.
12
 
 
 
13
  ## Model Summary
14
- - **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://arxiv.org/abs/2604.09497)
15
  - **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
16
  - **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
17
  - **Language:** English
@@ -22,7 +28,7 @@ BERTJudge models are designed as sequence classifiers that output a sigmoid scor
22
 
23
  ### Installation
24
 
25
- ```zsh
26
  git clone https://github.com/artefactory/BERT-as-a-Judge.git
27
  cd BERT-as-a-Judge
28
  pip install -e .
@@ -30,33 +36,30 @@ pip install -e .
30
 
31
  ### Usage
32
 
33
- Example:
34
 
35
  ```python
36
  from bert_judge.judges import BERTJudge
37
 
38
  # 1) Initialize the judge
39
  judge = BERTJudge(
40
- model_path="artefactory/BERTJudge",
41
  trust_remote_code=True,
42
  dtype="bfloat16",
43
  )
44
 
45
- # 2) Define one question, one reference, and several candidate answers
46
- question = "What is the capital of France?"
47
  reference = "Paris"
48
  candidates = [
49
  "Paris.",
50
  "The capital of France is Paris.",
51
- "I'm hesitating between Paris and London. I would say Paris.",
52
  "London.",
53
- "The capital of France is London.",
54
- "I'm hesitating between Paris and London. I would say London.",
55
  ]
56
 
57
  # 3) Predict scores (one score per candidate)
58
  scores = judge.predict(
59
- questions=[question] * len(candidates),
60
  references=[reference] * len(candidates),
61
  candidates=candidates,
62
  batch_size=1,
@@ -71,24 +74,24 @@ Models follow a standardized naming structure: `BERTJudge-<Candidate_Format>-<In
71
 
72
  * **Candidate Format:**
73
  * `Free`: Trained on unconstrained model generations.
74
- * `Formatted`: Trained on outputs that adhere to specific structural constraints. For optimized evaluation under the formatted setup, candidate outputs should ideally conclude with `"Final answer: <final_answer>"` (see the paper for details).
75
  * **Input Structure:**
76
  * `QCR`: The input sequence consists of [Question, Candidate, Reference].
77
  * `CR`: The input sequence consists only of [Candidate, Reference].
78
  * **Additional Info:**
79
- * `OOD`: Indicates evaluation of Out-of-Distribution performance (where specific generative models were withheld during training).
80
- * `100k/200k/500k`: Denotes the total training steps (default regime being 1 million).
81
 
82
- **Note: For optimal evaluation performance, we recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.**
83
 
84
  ## Citation
85
 
86
  If you find this model useful for your research, please consider citing:
87
 
88
- ```
89
  @article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
90
  title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
91
- author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\'e}line and Colombo, Pierre},
92
  year={2026},
93
  eprint={2604.09497},
94
  archivePrefix={arXiv},
 
1
  ---
2
+ base_model:
3
+ - EuroBERT/EuroBERT-210m
4
  datasets:
5
  - hgissbkh/BERTJudge-Dataset
6
  language:
7
  - en
8
+ library_name: transformers
9
+ pipeline_tag: text-classification
10
+ license: apache-2.0
11
  ---
12
+
13
  # BERTJudge-Formatted-CR
14
 
15
  BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. Moving beyond rigid lexical extraction and matching, these models evaluate semantic correctness, accommodating variations in phrasing and formatting while using only a fraction of the computational resources required by LLM-as-a-Judge approaches.
16
 
17
+ This specific variant, **BERTJudge-Formatted-CR**, is optimized for evaluating candidate answers that adhere to specific structural constraints (formatted) and utilizes the **[Candidate, Reference]** input structure.
18
+
19
  ## Model Summary
20
+ - **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](https://huggingface.co/papers/2604.09497)
21
  - **Code:** [https://github.com/artefactory/BERT-as-a-Judge](https://github.com/artefactory/BERT-as-a-Judge)
22
  - **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
23
  - **Language:** English
 
28
 
29
  ### Installation
30
 
31
+ ```bash
32
  git clone https://github.com/artefactory/BERT-as-a-Judge.git
33
  cd BERT-as-a-Judge
34
  pip install -e .
 
36
 
37
  ### Usage
38
 
39
+ Example using the `bert_judge` library:
40
 
41
  ```python
42
  from bert_judge.judges import BERTJudge
43
 
44
  # 1) Initialize the judge
45
  judge = BERTJudge(
46
+ model_path="hgissbkh/BERTJudge-Formatted-CR",
47
  trust_remote_code=True,
48
  dtype="bfloat16",
49
  )
50
 
51
+ # 2) Define a reference and several candidate answers
52
+ # Note: For CR models, the question is not used in the sequence
53
  reference = "Paris"
54
  candidates = [
55
  "Paris.",
56
  "The capital of France is Paris.",
 
57
  "London.",
 
 
58
  ]
59
 
60
  # 3) Predict scores (one score per candidate)
61
  scores = judge.predict(
62
+ questions=[""] * len(candidates),
63
  references=[reference] * len(candidates),
64
  candidates=candidates,
65
  batch_size=1,
 
74
 
75
  * **Candidate Format:**
76
  * `Free`: Trained on unconstrained model generations.
77
+ * `Formatted`: Trained on outputs that adhere to specific structural constraints (ideally concluding with `"Final answer: <final_answer>"`).
78
  * **Input Structure:**
79
  * `QCR`: The input sequence consists of [Question, Candidate, Reference].
80
  * `CR`: The input sequence consists only of [Candidate, Reference].
81
  * **Additional Info:**
82
+ * `OOD`: Indicates evaluation of Out-of-Distribution performance.
83
+ * `100k/200k/500k`: Denotes the total training steps (default is 1 million).
84
 
85
+ **Note: For optimal general evaluation performance, the authors recommend using `BERTJudge-Free-QCR`, available as `artefactory/BERTJudge`.**
86
 
87
  ## Citation
88
 
89
  If you find this model useful for your research, please consider citing:
90
 
91
+ ```bibtex
92
  @article{gisserotboukhlef2026bertasajudgerobustalternativelexical,
93
  title={BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation},
94
+ author={Gisserot-Boukhlef, Hippolyte and Boizard, Nicolas and Malherbe, Emmanuel and Hudelot, C{\\'e}line and Colombo, Pierre},
95
  year={2026},
96
  eprint={2604.09497},
97
  archivePrefix={arXiv},