Text Classification
Transformers
Safetensors
qwen2
text-generation
text-embeddings-inference
File size: 5,681 Bytes
0e89b76
e5186bd
 
8ec837e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5186bd
 
 
0e89b76
 
94dce1a
 
 
 
 
8ec837e
9fd43a7
8ec837e
94dce1a
 
 
 
 
116c985
 
 
94dce1a
 
 
00148e1
8ec837e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e89b76
1463130
42eeb37
1463130
42eeb37
 
 
 
 
 
 
 
 
 
 
 
1463130
42eeb37
 
44c48a5
42eeb37
 
 
 
c9f8692
42eeb37
 
0e89b76
 
 
 
94dce1a
8ec837e
94dce1a
8ec837e
94dce1a
 
 
02bd081
8ec837e
 
 
02bd081
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- virtuoussy/Multi-subject-RLVR
- sarosavo/Master-RM
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
library_name: transformers
license: apache-2.0
pipeline_tag: text-classification
---

# Robust Reward Model for LLM-as-a-Judge

This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).

- **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
- **Training Data**: [https://huggingface.co/datasets/sarosavo/Master-RM](https://huggingface.co/datasets/sarosavo/Master-RM)
<!-- - **Code/GitHub Repository**: [https://github.com/Yulai-Zhao/Robust-Reward-Model](https://github.com/Yulai-Zhao/Robust-Reward-Model) -->
- **Training algorithm**: Standard supervised fine-tuning, see Appendix A.2 for more details.

## Model Description

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.

We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR. 

To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier).

## How to use

Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness. An example inference script is provided below.

> ```python
> from transformers import AutoTokenizer, AutoModelForCausalLM
> 
> tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM")
> model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM")
> 
> PROMPT= '''
> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.  
> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.  
> **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**  
> 
> Your task:  
> - Compare the final output of the solution process with the reference answer.  
> - If they **match exactly**, output **YES**.  
> - If they **do not match**, output **NO**.  
> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.  
> 
> Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.  
> 
> ---
> 
> **Question:**  
> {question}  
> 
> **Solution Process (Final Step Only):**  
> {response}  
> 
> **Reference Answer:**  
> {reference}  
> 
> **Output:**  
> '''
> 
> 
> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (  )."
> label="Chen Heqin"
> answer="heqin chen"
> 
> prompt_question = PROMPT.format(question=question, reference=label, response=answer)
> messages=[
>            {"role": "system", "content": "You are a helpful assistant."},
>            {"role": "user", "content": prompt_question},
>          ]
> 
> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
> output=model.generate(input_ids,do_sample=False)
> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
> print("Model judgement: ",judgement)
> ```

## Use this reward model for RLVR training

### 1. Launch a remote reward server with vllm

The script below will launch a reward at http://127.0.0.1:8000/get_reward

```bash
bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}

# MODEL_PATH: the path of our reward model.
# ANSWER_PATH: the path of the training data.
# METRIC: greedy/prob
# This will launch a reward at http://127.0.0.1:8000/get_reward
```

### 2. Start RLVR training

```bash
bash reward_server/RLVR_train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}

# METHOD:          advantage estimator, e.g., reinforce_baseline, reinforce, rloo
# PRETRAIN_PATH:   path to the pretrained model, e.g., Qwen2.5-7B
# DATA_PATH:       path to the QA data with which we want to perform RL reasoning
# REWARD_API:      remote reward server url, e.g., http://127.0.0.1:8000/get_reward
```

## Citation

If you use this model, please cite:

```bibtex
@article{zhao2025one,
  title={One Token to Fool LLM-as-a-Judge},
  author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong},
  journal={arXiv preprint arXiv:2507.08794},
  year={2025}
}
```

## Acknowledgements

The development of this model is built upon [Qwen2.5-7B-Instruct-RLVR](https://huggingface.co/virtuoussy/Qwen2.5-7B-Instruct-RLVR).