Update README.md
Browse files
README.md
CHANGED
|
@@ -47,11 +47,97 @@ The Archer series focuses on research into RL algorithms and training for medium
|
|
| 47 |
## Evaluation
|
| 48 |
We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
|
| 49 |
|
| 50 |
-
<
|
| 51 |
-
|
| 52 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
</div>
|
| 55 |
|
| 56 |
<table>
|
| 57 |
<thead>
|
|
|
|
| 47 |
## Evaluation
|
| 48 |
We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
|
| 49 |
|
| 50 |
+
<table>
|
| 51 |
+
<thead>
|
| 52 |
+
<tr>
|
| 53 |
+
<th rowspan="2">Method</th>
|
| 54 |
+
<th colspan="2">AIME24</th>
|
| 55 |
+
<th colspan="2">AIME25</th>
|
| 56 |
+
<th colspan="2">AMC23</th>
|
| 57 |
+
<th colspan="2">MATH-500</th>
|
| 58 |
+
<th colspan="2">Minerva</th>
|
| 59 |
+
<th colspan="2">Olympiad</th>
|
| 60 |
+
<th rowspan="2">Avg.</th>
|
| 61 |
+
</tr>
|
| 62 |
+
<tr>
|
| 63 |
+
<th>avg@64</th>
|
| 64 |
+
<th>pass@64</th>
|
| 65 |
+
<th>avg@64</th>
|
| 66 |
+
<th>pass@64</th>
|
| 67 |
+
<th>avg@64</th>
|
| 68 |
+
<th>pass@64</th>
|
| 69 |
+
<th>avg@4</th>
|
| 70 |
+
<th>pass@4</th>
|
| 71 |
+
<th>avg@8</th>
|
| 72 |
+
<th>pass@8</th>
|
| 73 |
+
<th>avg@4</th>
|
| 74 |
+
<th>pass@4</th>
|
| 75 |
+
</tr>
|
| 76 |
+
</thead>
|
| 77 |
+
<tbody>
|
| 78 |
+
<tr>
|
| 79 |
+
<td>DeepSeek-R1-1.5B</td>
|
| 80 |
+
<td>30.6</td><td>80.0</td>
|
| 81 |
+
<td>23.5</td><td>63.3</td>
|
| 82 |
+
<td>70.7</td><td>100.0</td>
|
| 83 |
+
<td>83.6</td><td>92.4</td>
|
| 84 |
+
<td>27.6</td><td>48.2</td>
|
| 85 |
+
<td>44.6</td><td>59.4</td>
|
| 86 |
+
<td>46.8</td>
|
| 87 |
+
</tr>
|
| 88 |
+
<tr>
|
| 89 |
+
<td>DAPO</td>
|
| 90 |
+
<td>42.1</td><td>80.0</td>
|
| 91 |
+
<td>28.6</td><td>56.7</td>
|
| 92 |
+
<td>80.3</td><td>97.5</td>
|
| 93 |
+
<td>87.6</td><td>94.6</td>
|
| 94 |
+
<td>29.2</td><td>46.3</td>
|
| 95 |
+
<td>53.2</td><td>65.8</td>
|
| 96 |
+
<td>53.5</td>
|
| 97 |
+
</tr>
|
| 98 |
+
<tr>
|
| 99 |
+
<td>DeepScaleR-1.5B</td>
|
| 100 |
+
<td>42.0</td><td><strong>83.3</strong></td>
|
| 101 |
+
<td>29.0</td><td>63.3</td>
|
| 102 |
+
<td>81.3</td><td>100.0</td>
|
| 103 |
+
<td>87.7</td><td>93.6</td>
|
| 104 |
+
<td>30.3</td><td>51.1</td>
|
| 105 |
+
<td>50.7</td><td>61.0</td>
|
| 106 |
+
<td>53.5</td>
|
| 107 |
+
</tr>
|
| 108 |
+
<tr>
|
| 109 |
+
<td>FastCuRL-1.5B-V3</td>
|
| 110 |
+
<td>48.1</td><td>80.0</td>
|
| 111 |
+
<td>32.7</td><td>60.0</td>
|
| 112 |
+
<td><strong>86.4</strong></td><td>95.0</td>
|
| 113 |
+
<td>89.8</td><td>94.0</td>
|
| 114 |
+
<td>33.6</td><td>50.0</td>
|
| 115 |
+
<td>55.3</td><td>64.3</td>
|
| 116 |
+
<td>57.7</td>
|
| 117 |
+
</tr>
|
| 118 |
+
<tr>
|
| 119 |
+
<td>Nemotron-1.5B</td>
|
| 120 |
+
<td>48.0</td><td>76.7</td>
|
| 121 |
+
<td>33.1</td><td>60.0</td>
|
| 122 |
+
<td>86.1</td><td>97.5</td>
|
| 123 |
+
<td>90.6</td><td>93.6</td>
|
| 124 |
+
<td>35.3</td><td>47.8</td>
|
| 125 |
+
<td>59.2</td><td>66.8</td>
|
| 126 |
+
<td>58.7</td>
|
| 127 |
+
</tr>
|
| 128 |
+
<tr>
|
| 129 |
+
<td><strong>Archer-Math-1.5B</strong></td>
|
| 130 |
+
<td><strong>48.7</strong></td><td><strong>83.3</strong></td>
|
| 131 |
+
<td><strong>33.8</strong></td><td><strong>70.0</strong></td>
|
| 132 |
+
<td>86.0</td><td><strong>97.5</strong></td>
|
| 133 |
+
<td><strong>90.8</strong></td><td><strong>94.4</strong></td>
|
| 134 |
+
<td><strong>35.7</strong></td><td><strong>51.1</strong></td>
|
| 135 |
+
<td><strong>59.3</strong></td><td><strong>67.1</strong></td>
|
| 136 |
+
<td><strong>59.1</strong></td>
|
| 137 |
+
</tr>
|
| 138 |
+
</tbody>
|
| 139 |
+
</table>
|
| 140 |
|
|
|
|
| 141 |
|
| 142 |
<table>
|
| 143 |
<thead>
|