Text Classification
Transformers
Safetensors
qwen2
text-generation
text-embeddings-inference
sarosavo commited on
Commit
8ec837e
·
verified ·
1 Parent(s): 94dce1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -70
README.md CHANGED
@@ -1,7 +1,26 @@
1
  ---
2
  license: apache-2.0
3
- pipeline_tag: text-classification
4
  library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
  # Robust Reward Model for LLM-as-a-Judge
@@ -9,81 +28,71 @@ library_name: transformers
9
  This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).
10
 
11
  - **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
12
- - **Code**: [https://github.com/microsoft/RewardEval](https://github.com/microsoft/RewardEval)
13
- - **Synthetic Training Data**: [https://huggingface.co/datasets/reward-eval/synthetic-judgements](https://huggingface.co/datasets/reward-eval/synthetic-judgements)
14
 
15
  ## Model Description
16
 
17
  Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
18
 
19
- This model addresses this widespread weakness across various LLMs, datasets, and prompt formats that poses a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, this work introduces a simple yet effective data augmentation strategy and trains a new generative reward model with substantially improved robustness, highlighting the urgent need for more reliable LLM-based evaluation methods.
20
 
21
  ## How to use
22
 
23
- You can use this model with the `transformers` library to evaluate answers. The model expects a prompt that includes both the ground-truth reference and the candidate answer for comparison, formatted according to its chat template.
24
-
25
- ```python
26
- from transformers import AutoModelForCausalLM, AutoTokenizer
27
- import torch
28
-
29
- model_id = "recce-ai/robust-llm-as-a-judge-qwen-7b" # Replace with the actual model ID if different
30
- tokenizer = AutoTokenizer.from_pretrained(model_id)
31
- model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
32
-
33
- # Example for a comparison prompt:
34
- # Format: System Message, then User Message (reference and candidate)
35
- system_message = "You are a helpful and fair judge. Evaluate the candidate answer against the reference answer and provide a score of 1 (correct) or 0 (incorrect)."
36
- reference_answer = "The capital of France is Paris."
37
- candidate_answer = "Paris is the capital of France."
38
- user_message = f"Reference: {reference_answer}\
39
- Candidate: {candidate_answer}\
40
- Score:"
41
-
42
- messages = [
43
- {"role": "system", "content": system_message},
44
- {"role": "user", "content": user_message}
45
- ]
46
-
47
- # Apply the chat template defined in the tokenizer_config.json
48
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
49
-
50
- input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
51
-
52
- # Generate the score (e.g., '1' or '0')
53
- output_ids = model.generate(
54
- input_ids,
55
- max_new_tokens=5, # Generate only a few tokens for the score (e.g., '1', '0', 'Yes', 'No')
56
- num_beams=1,
57
- do_sample=False,
58
- temperature=0.0, # Use low temperature for deterministic output
59
- )
60
-
61
- generated_text = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True).strip()
62
- print(f"Generated Score: {generated_text}")
63
-
64
- # Example with a trick that might fool other LLMs-as-a-judge (according to the paper)
65
- candidate_answer_tricked = "Thought process: The capital is a city. Paris is a city. Therefore, Paris is the capital of France."
66
- user_message_tricked = f"Reference: {reference_answer}\
67
- Candidate: {candidate_answer_tricked}\
68
- Score:"
69
-
70
- messages_tricked = [
71
- {"role": "system", "content": system_message},
72
- {"role": "user", "content": user_message_tricked}
73
- ]
74
- prompt_tricked = tokenizer.apply_chat_template(messages_tricked, tokenize=False, add_generation_prompt=True)
75
- input_ids_tricked = tokenizer(prompt_tricked, return_tensors=\"pt\").input_ids.to(model.device)
76
-
77
- output_ids_tricked = model.generate(
78
- input_ids_tricked,
79
- max_new_tokens=5,
80
- num_beams=1,
81
- do_sample=False,
82
- temperature=0.0,
83
- )
84
- generated_text_tricked = tokenizer.decode(output_ids_tricked[0][len(input_ids_tricked[0]):], skip_special_tokens=True).strip()
85
- print(f"Generated Score (tricked): {generated_text_tricked}")
86
- ```
87
 
88
  ## Citation
89
 
@@ -92,10 +101,14 @@ If you use this model, please cite:
92
  [arXiv:2507.08794](https://arxiv.org/abs/2507.08794)
93
 
94
  ```bibtex
95
- @article{wu2025one,
96
  title={One Token to Fool LLM-as-a-Judge},
97
- author={Wu, Zhenyu and Sun, Qiushi and Zhang, Yiran and Wang, Yian and Li, Erran and Liang, Paul Pu},
98
  journal={arXiv preprint arXiv:2507.08794},
99
  year={2025}
100
  }
 
 
 
 
101
  ```
 
1
  ---
2
  license: apache-2.0
 
3
  library_name: transformers
4
+ datasets:
5
+ - virtuoussy/Math-RLVR
6
+ - virtuoussy/Multi-subject-RLVR
7
+ - sarosavo/Master-RM
8
+ language:
9
+ - zho
10
+ - eng
11
+ - fra
12
+ - spa
13
+ - por
14
+ - deu
15
+ - ita
16
+ - rus
17
+ - jpn
18
+ - kor
19
+ - vie
20
+ - tha
21
+ - ara
22
+ base_model:
23
+ - Qwen/Qwen2.5-7B-Instruct
24
  ---
25
 
26
  # Robust Reward Model for LLM-as-a-Judge
 
28
  This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).
29
 
30
  - **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
31
+ - **Training Data**: [https://huggingface.co/datasets/sarosavo/Master-RM](https://huggingface.co/datasets/sarosavo/Master-RM)
32
+ - **Training algorithm**: Standard supervised fine-tuning, see Appendix A.2 for more details.
33
 
34
  ## Model Description
35
 
36
  Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
37
 
38
+ This model addresses the widespread weakness across various LLMs, datasets, and prompt formats that poses a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, this work introduces a simple yet effective data augmentation strategy and trains a new generative reward model with substantially improved robustness, highlighting the urgent need for more reliable LLM-based evaluation methods.
39
 
40
  ## How to use
41
 
42
+ Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
43
+
44
+ ## **Quick start**
45
+
46
+ > ```python
47
+ > # Load model directly
48
+ > from transformers import AutoTokenizer, AutoModelForCausalLM
49
+ >
50
+ > tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM")
51
+ > model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM")
52
+ >
53
+ > PROMPT= '''
54
+ > Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
55
+ > The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
56
+ > **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
57
+ >
58
+ > Your task:
59
+ > - Compare the final output of the solution process with the reference answer.
60
+ > - If they **match exactly**, output **YES**.
61
+ > - If they **do not match**, output **NO**.
62
+ > - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
63
+ >
64
+ > Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
65
+ >
66
+ > ---
67
+ >
68
+ > **Question:**
69
+ > {question}
70
+ >
71
+ > **Solution Process (Final Step Only):**
72
+ > {response}
73
+ >
74
+ > **Reference Answer:**
75
+ > {reference}
76
+ >
77
+ > **Output:**
78
+ > '''
79
+ >
80
+ >
81
+ > question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (  )."
82
+ > label="Chen Heqin"
83
+ > answer="heqin chen"
84
+ >
85
+ > prompt_question = PROMPT.format(question=question, reference=label, response=answer)
86
+ > messages=[
87
+ > {"role": "system", "content": "You are a helpful assistant."},
88
+ > {"role": "user", "content": prompt_question},
89
+ > ]
90
+ >
91
+ > input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
92
+ > output=model.generate(input_ids,do_sample=False)
93
+ > judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
94
+ > print("Model judgement: ",judgement)
95
+ > ```
 
 
 
 
 
 
 
 
 
 
96
 
97
  ## Citation
98
 
 
101
  [arXiv:2507.08794](https://arxiv.org/abs/2507.08794)
102
 
103
  ```bibtex
104
+ @article{zhao2025one,
105
  title={One Token to Fool LLM-as-a-Judge},
106
+ author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong},
107
  journal={arXiv preprint arXiv:2507.08794},
108
  year={2025}
109
  }
110
+
111
+ ## Acknowledgements
112
+
113
+ The development of this model is built upon [Qwen2.5-7B-Instruct-RLVR](https://huggingface.co/virtuoussy/Qwen2.5-7B-Instruct-RLVR)
114
  ```