File size: 5,681 Bytes
0e89b76 e5186bd 8ec837e e5186bd 0e89b76 94dce1a 8ec837e 9fd43a7 8ec837e 94dce1a 116c985 94dce1a 00148e1 8ec837e 0e89b76 1463130 42eeb37 1463130 42eeb37 1463130 42eeb37 44c48a5 42eeb37 c9f8692 42eeb37 0e89b76 94dce1a 8ec837e 94dce1a 8ec837e 94dce1a 02bd081 8ec837e 02bd081 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- virtuoussy/Multi-subject-RLVR
- sarosavo/Master-RM
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
library_name: transformers
license: apache-2.0
pipeline_tag: text-classification
---
# Robust Reward Model for LLM-as-a-Judge
This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).
- **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
- **Training Data**: [https://huggingface.co/datasets/sarosavo/Master-RM](https://huggingface.co/datasets/sarosavo/Master-RM)
<!-- - **Code/GitHub Repository**: [https://github.com/Yulai-Zhao/Robust-Reward-Model](https://github.com/Yulai-Zhao/Robust-Reward-Model) -->
- **Training algorithm**: Standard supervised fine-tuning, see Appendix A.2 for more details.
## Model Description
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR.
To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier).
## How to use
Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness. An example inference script is provided below.
> ```python
> from transformers import AutoTokenizer, AutoModelForCausalLM
>
> tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM")
> model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM")
>
> PROMPT= '''
> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
> **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
>
> Your task:
> - Compare the final output of the solution process with the reference answer.
> - If they **match exactly**, output **YES**.
> - If they **do not match**, output **NO**.
> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
>
> Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
>
> ---
>
> **Question:**
> {question}
>
> **Solution Process (Final Step Only):**
> {response}
>
> **Reference Answer:**
> {reference}
>
> **Output:**
> '''
>
>
> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is ( )."
> label="Chen Heqin"
> answer="heqin chen"
>
> prompt_question = PROMPT.format(question=question, reference=label, response=answer)
> messages=[
> {"role": "system", "content": "You are a helpful assistant."},
> {"role": "user", "content": prompt_question},
> ]
>
> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
> output=model.generate(input_ids,do_sample=False)
> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
> print("Model judgement: ",judgement)
> ```
## Use this reward model for RLVR training
### 1. Launch a remote reward server with vllm
The script below will launch a reward at http://127.0.0.1:8000/get_reward
```bash
bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
# MODEL_PATH: the path of our reward model.
# ANSWER_PATH: the path of the training data.
# METRIC: greedy/prob
# This will launch a reward at http://127.0.0.1:8000/get_reward
```
### 2. Start RLVR training
```bash
bash reward_server/RLVR_train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
# METHOD: advantage estimator, e.g., reinforce_baseline, reinforce, rloo
# PRETRAIN_PATH: path to the pretrained model, e.g., Qwen2.5-7B
# DATA_PATH: path to the QA data with which we want to perform RL reasoning
# REWARD_API: remote reward server url, e.g., http://127.0.0.1:8000/get_reward
```
## Citation
If you use this model, please cite:
```bibtex
@article{zhao2025one,
title={One Token to Fool LLM-as-a-Judge},
author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong},
journal={arXiv preprint arXiv:2507.08794},
year={2025}
}
```
## Acknowledgements
The development of this model is built upon [Qwen2.5-7B-Instruct-RLVR](https://huggingface.co/virtuoussy/Qwen2.5-7B-Instruct-RLVR). |