Update README.md
Browse files
README.md
CHANGED
|
@@ -35,11 +35,13 @@ This repository contains a robust, general-domain generative reward model presen
|
|
| 35 |
|
| 36 |
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## How to use
|
| 41 |
|
| 42 |
-
Inputting the question,
|
| 43 |
|
| 44 |
## **Quick start**
|
| 45 |
|
|
|
|
| 35 |
|
| 36 |
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.
|
| 37 |
|
| 38 |
+
We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR.
|
| 39 |
+
|
| 40 |
+
To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier).
|
| 41 |
|
| 42 |
## How to use
|
| 43 |
|
| 44 |
+
Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness.
|
| 45 |
|
| 46 |
## **Quick start**
|
| 47 |
|