Master-RM / README.md

Update README.md

9fd43a7 verified 7 months ago

5.68 kB

	---
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	datasets:
	- virtuoussy/Multi-subject-RLVR
	- sarosavo/Master-RM
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-classification
	---

	# Robust Reward Model for LLM-as-a-Judge

	This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).

	- Paper: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
	- Training Data: [https://huggingface.co/datasets/sarosavo/Master-RM](https://huggingface.co/datasets/sarosavo/Master-RM)
	<!-- - Code/GitHub Repository: [https://github.com/Yulai-Zhao/Robust-Reward-Model](https://github.com/Yulai-Zhao/Robust-Reward-Model) -->
	- Training algorithm: Standard supervised fine-tuning, see Appendix A.2 for more details.

	## Model Description

	Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.

	We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR.

	To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier).

	## How to use

	Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness. An example inference script is provided below.

	> ```python
	> from transformers import AutoTokenizer, AutoModelForCausalLM
	>
	> tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM")
	> model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM")
	>
	> PROMPT= '''
	> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
	> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
	> The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.
	>
	> Your task:
	> - Compare the final output of the solution process with the reference answer.
	> - If they match exactly, output YES.
	> - If they do not match, output NO.
	> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output NO.
	>
	> Your output must be strictly 'YES' or 'NO', with no additional words, punctuation, or explanation.
	>
	> ---
	>
	> Question:
	> {question}
	>
	> Solution Process (Final Step Only):
	> {response}
	>
	> Reference Answer:
	> {reference}
	>
	> Output:
	> '''
	>
	>
	> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (　　)."
	> label="Chen Heqin"
	> answer="heqin chen"
	>
	> prompt_question = PROMPT.format(question=question, reference=label, response=answer)
	> messages=[
	> {"role": "system", "content": "You are a helpful assistant."},
	> {"role": "user", "content": prompt_question},
	> ]
	>
	> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
	> output=model.generate(input_ids,do_sample=False)
	> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
	> print("Model judgement: ",judgement)
	> ```

	## Use this reward model for RLVR training

	### 1. Launch a remote reward server with vllm

	The script below will launch a reward at http://127.0.0.1:8000/get_reward

	```bash
	bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}

	# MODEL_PATH: the path of our reward model.
	# ANSWER_PATH: the path of the training data.
	# METRIC: greedy/prob
	# This will launch a reward at http://127.0.0.1:8000/get_reward
	```

	### 2. Start RLVR training

	```bash
	bash reward_server/RLVR_train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}

	# METHOD: advantage estimator, e.g., reinforce_baseline, reinforce, rloo
	# PRETRAIN_PATH: path to the pretrained model, e.g., Qwen2.5-7B
	# DATA_PATH: path to the QA data with which we want to perform RL reasoning
	# REWARD_API: remote reward server url, e.g., http://127.0.0.1:8000/get_reward
	```

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{zhao2025one,
	title={One Token to Fool LLM-as-a-Judge},
	author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong},
	journal={arXiv preprint arXiv:2507.08794},
	year={2025}
	}
	```

	## Acknowledgements

	The development of this model is built upon [Qwen2.5-7B-Instruct-RLVR](https://huggingface.co/virtuoussy/Qwen2.5-7B-Instruct-RLVR).