Update README.md
Browse files
README.md
CHANGED
|
@@ -27,6 +27,8 @@ Key Highlights:
|
|
| 27 |
- Best of N: By leveraging a combination of response sampling and Best-of-N strategies, we choose the response of top score judged by reward model, yielding better results with spending more inference time. For example, Qwen2-Math-1.5B-Instruct obtains 79.9 on MATH in RM@8 setting and even surpasses the performance of Qwen2-Math-7B-Instruct 75.1 with greedy decoding.
|
| 28 |
- Comparasion with majority voting (Maj@N): RM@N scores are substantially better than Maj@N scores aross almost all benchmarks and models.
|
| 29 |
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## Model Details
|
| 32 |
|
|
|
|
| 27 |
- Best of N: By leveraging a combination of response sampling and Best-of-N strategies, we choose the response of top score judged by reward model, yielding better results with spending more inference time. For example, Qwen2-Math-1.5B-Instruct obtains 79.9 on MATH in RM@8 setting and even surpasses the performance of Qwen2-Math-7B-Instruct 75.1 with greedy decoding.
|
| 28 |
- Comparasion with majority voting (Maj@N): RM@N scores are substantially better than Maj@N scores aross almost all benchmarks and models.
|
| 29 |
|
| 30 |
+

|
| 31 |
+
|
| 32 |
|
| 33 |
## Model Details
|
| 34 |
|