Update README.md
Browse files
README.md
CHANGED
|
@@ -35,8 +35,8 @@ The table below compares mainstream models on the AIME24 and AIME25 benchmarks.
|
|
| 35 |
</tr>
|
| 36 |
<tr>
|
| 37 |
<td>DeepSeek-R1-0528</td>
|
| 38 |
-
<td><span style="color:
|
| 39 |
-
<td><span style="color:
|
| 40 |
</tr>
|
| 41 |
<tr>
|
| 42 |
<td>Qwen3-235B-A22B</td>
|
|
@@ -50,7 +50,7 @@ The table below compares mainstream models on the AIME24 and AIME25 benchmarks.
|
|
| 50 |
</tr>
|
| 51 |
<tr>
|
| 52 |
<td>Gemini-2.5-Pro-0506</td>
|
| 53 |
-
<td><span style="color:
|
| 54 |
<td><span style="color:grey">83</span></td>
|
| 55 |
</tr>
|
| 56 |
<!-- 合并行表头 32B -->
|
|
@@ -89,32 +89,3 @@ The table below compares mainstream models on the AIME24 and AIME25 benchmarks.
|
|
| 89 |
<td><b>84.2</b></td>
|
| 90 |
</tr>
|
| 91 |
</table>
|
| 92 |
-
|
| 93 |
-
> *Note: Generated results for AIME24/25 are available in the [`pcl_reasoner_v1/eval/eval_res`](https://openi.pcl.ac.cn/PCL-Reasoner/V1) directory for developer verification and comparison.*
|
| 94 |
-
|
| 95 |
-
#### Impact of Answer Length on Accuracy
|
| 96 |
-
We analyzed the relationship between maximum answer length (`max_tokens`) and model accuracy. Due to results listed below, we find that on AIME24 which is relatively simpler, decode length of 64K are sufficient to achieve peak accuracy of 85.7%. In contrast, AIME25 which is relatively harder requires 128K tokens to reach optimal performance (84.2%):
|
| 97 |
-
|
| 98 |
-
<table>
|
| 99 |
-
<tr>
|
| 100 |
-
<th>max tokens</th>
|
| 101 |
-
<th>16K</th>
|
| 102 |
-
<th>32K</th>
|
| 103 |
-
<th>64K</th>
|
| 104 |
-
<th>128K</th>
|
| 105 |
-
</tr>
|
| 106 |
-
<tr>
|
| 107 |
-
<td>AIME24</td>
|
| 108 |
-
<td>42.0</td>
|
| 109 |
-
<td>77.9</td>
|
| 110 |
-
<td>85.7</td>
|
| 111 |
-
<td>85.7</td>
|
| 112 |
-
</tr>
|
| 113 |
-
<tr>
|
| 114 |
-
<td>AIME25</td>
|
| 115 |
-
<td>33.4</td>
|
| 116 |
-
<td>75.6</td>
|
| 117 |
-
<td>83.9</td>
|
| 118 |
-
<td>84.2</td>
|
| 119 |
-
</tr>
|
| 120 |
-
</table>
|
|
|
|
| 35 |
</tr>
|
| 36 |
<tr>
|
| 37 |
<td>DeepSeek-R1-0528</td>
|
| 38 |
+
<td><span style="color:grey">91.4</span></td>
|
| 39 |
+
<td><span style="color:grey">87.5</span></td>
|
| 40 |
</tr>
|
| 41 |
<tr>
|
| 42 |
<td>Qwen3-235B-A22B</td>
|
|
|
|
| 50 |
</tr>
|
| 51 |
<tr>
|
| 52 |
<td>Gemini-2.5-Pro-0506</td>
|
| 53 |
+
<td><span style="color:grey">90.8</span></td>
|
| 54 |
<td><span style="color:grey">83</span></td>
|
| 55 |
</tr>
|
| 56 |
<!-- 合并行表头 32B -->
|
|
|
|
| 89 |
<td><b>84.2</b></td>
|
| 90 |
</tr>
|
| 91 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|