sarosavo
/

Master-RM

@@ -1,7 +1,26 @@
 ---
 license: apache-2.0
-pipeline_tag: text-classification
 library_name: transformers
 ---
 # Robust Reward Model for LLM-as-a-Judge
@@ -9,81 +28,71 @@ library_name: transformers
 This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).
 - **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
-- **Code**: [https://github.com/microsoft/RewardEval](https://github.com/microsoft/RewardEval)
-- **Synthetic Training Data**: [https://huggingface.co/datasets/reward-eval/synthetic-judgements](https://huggingface.co/datasets/reward-eval/synthetic-judgements)
 ## Model Description
 Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
-This model addresses this widespread weakness across various LLMs, datasets, and prompt formats that poses a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, this work introduces a simple yet effective data augmentation strategy and trains a new generative reward model with substantially improved robustness, highlighting the urgent need for more reliable LLM-based evaluation methods.
 ## How to use
-You can use this model with the `transformers` library to evaluate answers. The model expects a prompt that includes both the ground-truth reference and the candidate answer for comparison, formatted according to its chat template.
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model_id = "recce-ai/robust-llm-as-a-judge-qwen-7b" # Replace with the actual model ID if different
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
-# Example for a comparison prompt:
-# Format: System Message, then User Message (reference and candidate)
-system_message = "You are a helpful and fair judge. Evaluate the candidate answer against the reference answer and provide a score of 1 (correct) or 0 (incorrect)."
-reference_answer = "The capital of France is Paris."
-candidate_answer = "Paris is the capital of France."
-user_message = f"Reference: {reference_answer}\
-Candidate: {candidate_answer}\
-Score:"
-messages = [
-    {"role": "system", "content": system_message},
-    {"role": "user", "content": user_message}
-]
-# Apply the chat template defined in the tokenizer_config.json
-prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
-# Generate the score (e.g., '1' or '0')
-output_ids = model.generate(
-    input_ids,
-    max_new_tokens=5, # Generate only a few tokens for the score (e.g., '1', '0', 'Yes', 'No')
-    num_beams=1,
-    do_sample=False,
-    temperature=0.0, # Use low temperature for deterministic output
-)
-generated_text = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True).strip()
-print(f"Generated Score: {generated_text}")
-# Example with a trick that might fool other LLMs-as-a-judge (according to the paper)
-candidate_answer_tricked = "Thought process: The capital is a city. Paris is a city. Therefore, Paris is the capital of France."
-user_message_tricked = f"Reference: {reference_answer}\
-Candidate: {candidate_answer_tricked}\
-Score:"
-messages_tricked = [
-    {"role": "system", "content": system_message},
-    {"role": "user", "content": user_message_tricked}
-]
-prompt_tricked = tokenizer.apply_chat_template(messages_tricked, tokenize=False, add_generation_prompt=True)
-input_ids_tricked = tokenizer(prompt_tricked, return_tensors=\"pt\").input_ids.to(model.device)
-output_ids_tricked = model.generate(
-    input_ids_tricked,
-    max_new_tokens=5,
-    num_beams=1,
-    do_sample=False,
-    temperature=0.0,
-)
-generated_text_tricked = tokenizer.decode(output_ids_tricked[0][len(input_ids_tricked[0]):], skip_special_tokens=True).strip()
-print(f"Generated Score (tricked): {generated_text_tricked}")
-```
 ## Citation
@@ -92,10 +101,14 @@ If you use this model, please cite:
 [arXiv:2507.08794](https://arxiv.org/abs/2507.08794)
 ```bibtex
-@article{wu2025one,
   title={One Token to Fool LLM-as-a-Judge},
-  author={Wu, Zhenyu and Sun, Qiushi and Zhang, Yiran and Wang, Yian and Li, Erran and Liang, Paul Pu},
   journal={arXiv preprint arXiv:2507.08794},
   year={2025}
 }
 ```

 ---
 license: apache-2.0
 library_name: transformers
+datasets:
+- virtuoussy/Math-RLVR
+- virtuoussy/Multi-subject-RLVR
+- sarosavo/Master-RM
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
 ---
 # Robust Reward Model for LLM-as-a-Judge
 This repository contains a robust, general-domain generative reward model presented in the paper [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794).
 - **Paper**: [One Token to Fool LLM-as-a-Judge](https://huggingface.co/papers/2507.08794)
+- **Training Data**: [https://huggingface.co/datasets/sarosavo/Master-RM](https://huggingface.co/datasets/sarosavo/Master-RM)
+- **Training algorithm**: Standard supervised fine-tuning, see Appendix A.2 for more details.
 ## Model Description
 Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
+This model addresses the widespread weakness across various LLMs, datasets, and prompt formats that poses a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, this work introduces a simple yet effective data augmentation strategy and trains a new generative reward model with substantially improved robustness, highlighting the urgent need for more reliable LLM-based evaluation methods.
 ## How to use
+Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
+## **Quick start**
+> ```python
+> # Load model directly
+> from transformers import AutoTokenizer, AutoModelForCausalLM
+>
+> tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM")
+> model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM")
+>
+> PROMPT= '''
+> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
+> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
+> **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
+>
+> Your task:
+> - Compare the final output of the solution process with the reference answer.
+> - If they **match exactly**, output **YES**.
+> - If they **do not match**, output **NO**.
+> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
+>
+> Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
+>
+> ---
+>
+> **Question:**
+> {question}
+>
+> **Solution Process (Final Step Only):**
+> {response}
+>
+> **Reference Answer:**
+> {reference}
+>
+> **Output:**
+> '''
+>
+>
+> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (　　)."
+> label="Chen Heqin"
+> answer="heqin chen"
+>
+> prompt_question = PROMPT.format(question=question, reference=label, response=answer)
+> messages=[
+>            {"role": "system", "content": "You are a helpful assistant."},
+>            {"role": "user", "content": prompt_question},
+>          ]
+>
+> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
+> output=model.generate(input_ids,do_sample=False)
+> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
+> print("Model judgement: ",judgement)
+> ```
 ## Citation
 [arXiv:2507.08794](https://arxiv.org/abs/2507.08794)
 ```bibtex
+@article{zhao2025one,
   title={One Token to Fool LLM-as-a-Judge},
+  author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong},
   journal={arXiv preprint arXiv:2507.08794},
   year={2025}
 }
+## Acknowledgements
+The development of this model is built upon [Qwen2.5-7B-Instruct-RLVR](https://huggingface.co/virtuoussy/Qwen2.5-7B-Instruct-RLVR)
 ```