Update README.md
Browse files
README.md
CHANGED
|
@@ -179,8 +179,23 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
|
|
| 179 |
| | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | 45.2 | 48.0 |
|
| 180 |
| | EurusPRM-Stage 2 | 84.8 | 53.0 | 16.7 | 43.2 | **45.6** | 48.7 |
|
| 181 |
|
|
|
|
| 182 |
|
|
|
|
|
|
|
|
|
|
| 183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
## Citation
|
| 185 |
|
| 186 |
```latex
|
|
|
|
| 179 |
| | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | 45.2 | 48.0 |
|
| 180 |
| | EurusPRM-Stage 2 | 84.8 | 53.0 | 16.7 | 43.2 | **45.6** | 48.7 |
|
| 181 |
|
| 182 |
+
### ProcessBench
|
| 183 |
|
| 184 |
+
We evaluate **EurusPRM-Stage 1** and **EurusPRM-Stage 2** on **ProcessBench**.
|
| 185 |
+
The threshold is obtained by converting the original score of each step using sigmoid function and iterating to find the highest F1 on GSM8k sub-benchmark. The threshold for **EurusPRM-Stage 1** and **EurusPRM-Stage 2** is 0.5015 and 0.5005 respectively.
|
| 186 |
+
For leveraging the capibility of **EurusPRM** better, we add ``Step K`` (where K is the actual index of the step) in front of each step in **ProcessBench**.
|
| 187 |
|
| 188 |
+
| Reward Model | GSM8k | MATH | OlympiadBench | Omni-Math | Avg |
|
| 189 |
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 190 |
+
| Math-Shepherd-PRM-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
|
| 191 |
+
| RLHFlow-PRM-Mistral-8B | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
|
| 192 |
+
| RLHFlow-PRM-Deepseek-8B | 38.8 | 33.8 | 16.9 | 16.9 | 26.6 |
|
| 193 |
+
| Skywork-PRM-7B | **70.8** | **53.6** | 22.9 | 21.0 | 42.1 |
|
| 194 |
+
| EurusPRM-Stage 1 | 54.7 | 41.2 | 24.7 | 17.5 | 30.6 |
|
| 195 |
+
| EurusPRM-Stage 1-no-step | 42.1 | 33.1 | 13.2 | 15.4 | 23.1 |
|
| 196 |
+
| EurusPRM-Stage 2 | 67.0 | 53.2 | **35.4** | **30.7** | **42.8** |
|
| 197 |
+
| EurusPRM-Stage 2-no-step | 56.6 | 43.0 | 27.3 | 26.8 | 35.1 |
|
| 198 |
+
|
| 199 |
## Citation
|
| 200 |
|
| 201 |
```latex
|