Text Classification
Transformers
Safetensors
qwen2
text-generation
text-embeddings-inference
sarosavo commited on
Commit
116c985
·
verified ·
1 Parent(s): c5b438c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -35,11 +35,13 @@ This repository contains a robust, general-domain generative reward model presen
35
 
36
  Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
37
 
38
- This model addresses the widespread weakness across various LLMs, datasets, and prompt formats that poses a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, this work introduces a simple yet effective data augmentation strategy and trains a new generative reward model with substantially improved robustness, highlighting the urgent need for more reliable LLM-based evaluation methods.
 
 
39
 
40
  ## How to use
41
 
42
- Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
43
 
44
  ## **Quick start**
45
 
 
35
 
36
  Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
37
 
38
+ We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR.
39
+
40
+ To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier).
41
 
42
  ## How to use
43
 
44
+ Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness.
45
 
46
  ## **Quick start**
47