| | --- |
| | base_model: |
| | - Qwen/Qwen2.5-7B-Instruct |
| | datasets: |
| | - virtuoussy/Multi-subject-RLVR |
| | - sarosavo/Master-RM |
| | language: |
| | - zho |
| | - eng |
| | - fra |
| | - spa |
| | - por |
| | - deu |
| | - ita |
| | - rus |
| | - jpn |
| | - kor |
| | - vie |
| | - tha |
| | - ara |
| | library_name: transformers |
| | license: apache-2.0 |
| | pipeline_tag: text-classification |
| | --- |
| | |
| | # Robust Reward Model for LLM-as-a-Judge |
| |
|
| | This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794). |
| |
|
| | - **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794) |
| | - **Training Data**: [https://huggingface.co/datasets/sarosavo/Master-RM](https://huggingface.co/datasets/sarosavo/Master-RM) |
| | - **Code/GitHub Repository**: [https://github.com/Yulai-Zhao/Robust-Reward-Model](https://github.com/Yulai-Zhao/Robust-Reward-Model) |
| | - **Training algorithm**: Standard supervised fine-tuning, see Appendix A.2 for more details. |
| |
|
| | ## Model Description |
| |
|
| | Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. |
| |
|
| | We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR. |
| |
|
| | To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier). |
| |
|
| | ## How to use |
| |
|
| | Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness. An example inference script is provided below. |
| |
|
| | > ```python |
| | > from transformers import AutoTokenizer, AutoModelForCausalLM |
| | > |
| | > tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM") |
| | > model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM") |
| | > |
| | > PROMPT= ''' |
| | > Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer. |
| | > The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved. |
| | > **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.** |
| | > |
| | > Your task: |
| | > - Compare the final output of the solution process with the reference answer. |
| | > - If they **match exactly**, output **YES**. |
| | > - If they **do not match**, output **NO**. |
| | > - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**. |
| | > |
| | > Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation. |
| | > |
| | > --- |
| | > |
| | > **Question:** |
| | > {question} |
| | > |
| | > **Solution Process (Final Step Only):** |
| | > {response} |
| | > |
| | > **Reference Answer:** |
| | > {reference} |
| | > |
| | > **Output:** |
| | > ''' |
| | > |
| | > |
| | > question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is ( )." |
| | > label="Chen Heqin" |
| | > answer="heqin chen" |
| | > |
| | > prompt_question = PROMPT.format(question=question, reference=label, response=answer) |
| | > messages=[ |
| | > {"role": "system", "content": "You are a helpful assistant."}, |
| | > {"role": "user", "content": prompt_question}, |
| | > ] |
| | > |
| | > input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt") |
| | > output=model.generate(input_ids,do_sample=False) |
| | > judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True) |
| | > print("Model judgement: ",judgement) |
| | > ``` |
| |
|
| | ## Use this reward model for RLVR training |
| |
|
| | ### 1. Launch a remote reward server with vllm |
| |
|
| | The script below will launch a reward at http://127.0.0.1:8000/get_reward |
| | |
| | ```bash |
| | bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC} |
| | |
| | # MODEL_PATH: the path of our reward model. |
| | # ANSWER_PATH: the path of the training data. |
| | # METRIC: greedy/prob |
| | # This will launch a reward at http://127.0.0.1:8000/get_reward |
| | ``` |
| | |
| | ### 2. Start RLVR training |
| | |
| | ```bash |
| | bash reward_server/RLVR_train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API} |
| | |
| | # METHOD: advantage estimator, e.g., reinforce_baseline, reinforce, rloo |
| | # PRETRAIN_PATH: path to the pretrained model, e.g., Qwen2.5-7B |
| | # DATA_PATH: path to the QA data with which we want to perform RL reasoning |
| | # REWARD_API: remote reward server url, e.g., http://127.0.0.1:8000/get_reward |
| | ``` |
| | |
| | ## Citation |
| | |
| | If you use this model, please cite: |
| | |
| | ```bibtex |
| | @article{zhao2025one, |
| | title={One Token to Fool LLM-as-a-Judge}, |
| | author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong}, |
| | journal={arXiv preprint arXiv:2507.08794}, |
| | year={2025} |
| | } |
| | ``` |
| | |
| | ## Acknowledgements |
| | |
| | The development of this model is built upon [Qwen2.5-7B-Instruct-RLVR](https://huggingface.co/virtuoussy/Qwen2.5-7B-Instruct-RLVR). |