hgissbkh commited on
Commit
68d9058
·
verified ·
1 Parent(s): 8a32eb3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BERT-as-a-Judge: A Robust Alternative for LLM Evaluation
2
+
3
+ BERT-as-a-Judge is a family of encoder-based models designed for efficient, reference-based evaluation of LLM outputs. By moving beyond rigid lexical matching (like Exact Match or ROUGE), these models assess **semantic correctness**, allowing for variations in phrasing and formatting while maintaining a fraction of the computational cost of LLM-as-a-Judge approaches.
4
+
5
+ ## Model Summary
6
+ - **Paper:** [BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation](URL_TO_PAPER)
7
+ - **Model Type:** Encoder-based Judge (EuroBERT-210m backbone)
8
+ - **Language:** English
9
+
10
+ ---
11
+
12
+ ## Model Variations & Collection Overview
13
+
14
+ The models are named using the convention: `BERTJudge-<Output_Guidelines>-<Input_Format>-<Additional_Info>`.
15
+
16
+ ### Naming Convention Breakdown:
17
+ * **Output Guidelines:** * `Free`: Trained on unconstrained model outputs.
18
+ * `Formatted`: Trained on outputs constrained by specific instructions (e.g., "Conclude with Answer: [X]").
19
+ * **Input Format:** * `QCR`: Input contains [Question, Candidate, Reference].
20
+ * `CR`: Input contains only [Candidate, Reference].
21
+ * **Additional Info:** * `OOD`: Evaluates Out-of-Distribution performance (certain generative models excluded from training).
22
+ * `100k/200k/500k`: Number of training steps (Default is 1 Million).
23
+
24
+ ### Model Selection Table
25
+
26
+ | Model Name | Input Format | Guidelines | Training Steps | OOD Tested |
27
+ | :--- | :---: | :---: | :---: | :---: |
28
+ | **BERTJudge-Free-QCR** | QCR | Free | 1M | No |
29
+ | **BERTJudge-Formatted-QCR** | QCR | Formatted | 1M | No |
30
+ | **BERTJudge-Free-CR** | CR | Free | 1M | No |
31
+ | **BERTJudge-Free-QCR-OOD** | QCR | Free | 1M | **Yes** |
32
+ | **BERTJudge-Free-QCR-100k** | QCR | Free | 100k | No |
33
+ | **BERTJudge-Free-QCR-200k** | QCR | Free | 200k | No |
34
+ | **BERTJudge-Free-QCR-500k** | QCR | Free | 500k | No |
35
+
36
+ ---
37
+
38
+ ## Intended Use
39
+
40
+ ### How to Use
41
+
42
+ These models are typically used as sequence classifiers that output a score (0 for incorrect, 1 for correct).
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
46
+ import torch
47
+
48
+ model_name = "hgissbkh/BERTJudge-Free-QCR"
49
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
50
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
51
+
52
+ question = "What is the capital of France?"
53
+ reference = "Paris"
54
+ candidate = "The capital city is Paris."
55
+
56
+ # Construct input based on model type (QCR)
57
+ input_text = f"Question: {question} Reference: {reference} Candidate: {candidate}"
58
+ inputs = tokenizer(input_text, return_tensors="pt")
59
+
60
+ with torch.no_grad():
61
+ logits = model(**inputs).logits
62
+ prediction = torch.argmax(logits, dim=-1)
63
+
64
+ print("Correct" if prediction.item() == 1 else "Incorrect")