wizardII commited on
Commit
87d49c0
·
verified ·
1 Parent(s): f4abf8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -4
README.md CHANGED
@@ -47,11 +47,97 @@ The Archer series focuses on research into RL algorithms and training for medium
47
  ## Evaluation
48
  We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
49
 
50
- <div align="center">
51
-
52
- <img src="assets/math_benchmark_table.png" width="100%"/>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- </div>
55
 
56
  <table>
57
  <thead>
 
47
  ## Evaluation
48
  We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
49
 
50
+ <table>
51
+ <thead>
52
+ <tr>
53
+ <th rowspan="2">Method</th>
54
+ <th colspan="2">AIME24</th>
55
+ <th colspan="2">AIME25</th>
56
+ <th colspan="2">AMC23</th>
57
+ <th colspan="2">MATH-500</th>
58
+ <th colspan="2">Minerva</th>
59
+ <th colspan="2">Olympiad</th>
60
+ <th rowspan="2">Avg.</th>
61
+ </tr>
62
+ <tr>
63
+ <th>avg@64</th>
64
+ <th>pass@64</th>
65
+ <th>avg@64</th>
66
+ <th>pass@64</th>
67
+ <th>avg@64</th>
68
+ <th>pass@64</th>
69
+ <th>avg@4</th>
70
+ <th>pass@4</th>
71
+ <th>avg@8</th>
72
+ <th>pass@8</th>
73
+ <th>avg@4</th>
74
+ <th>pass@4</th>
75
+ </tr>
76
+ </thead>
77
+ <tbody>
78
+ <tr>
79
+ <td>DeepSeek-R1-1.5B</td>
80
+ <td>30.6</td><td>80.0</td>
81
+ <td>23.5</td><td>63.3</td>
82
+ <td>70.7</td><td>100.0</td>
83
+ <td>83.6</td><td>92.4</td>
84
+ <td>27.6</td><td>48.2</td>
85
+ <td>44.6</td><td>59.4</td>
86
+ <td>46.8</td>
87
+ </tr>
88
+ <tr>
89
+ <td>DAPO</td>
90
+ <td>42.1</td><td>80.0</td>
91
+ <td>28.6</td><td>56.7</td>
92
+ <td>80.3</td><td>97.5</td>
93
+ <td>87.6</td><td>94.6</td>
94
+ <td>29.2</td><td>46.3</td>
95
+ <td>53.2</td><td>65.8</td>
96
+ <td>53.5</td>
97
+ </tr>
98
+ <tr>
99
+ <td>DeepScaleR-1.5B</td>
100
+ <td>42.0</td><td><strong>83.3</strong></td>
101
+ <td>29.0</td><td>63.3</td>
102
+ <td>81.3</td><td>100.0</td>
103
+ <td>87.7</td><td>93.6</td>
104
+ <td>30.3</td><td>51.1</td>
105
+ <td>50.7</td><td>61.0</td>
106
+ <td>53.5</td>
107
+ </tr>
108
+ <tr>
109
+ <td>FastCuRL-1.5B-V3</td>
110
+ <td>48.1</td><td>80.0</td>
111
+ <td>32.7</td><td>60.0</td>
112
+ <td><strong>86.4</strong></td><td>95.0</td>
113
+ <td>89.8</td><td>94.0</td>
114
+ <td>33.6</td><td>50.0</td>
115
+ <td>55.3</td><td>64.3</td>
116
+ <td>57.7</td>
117
+ </tr>
118
+ <tr>
119
+ <td>Nemotron-1.5B</td>
120
+ <td>48.0</td><td>76.7</td>
121
+ <td>33.1</td><td>60.0</td>
122
+ <td>86.1</td><td>97.5</td>
123
+ <td>90.6</td><td>93.6</td>
124
+ <td>35.3</td><td>47.8</td>
125
+ <td>59.2</td><td>66.8</td>
126
+ <td>58.7</td>
127
+ </tr>
128
+ <tr>
129
+ <td><strong>Archer-Math-1.5B</strong></td>
130
+ <td><strong>48.7</strong></td><td><strong>83.3</strong></td>
131
+ <td><strong>33.8</strong></td><td><strong>70.0</strong></td>
132
+ <td>86.0</td><td><strong>97.5</strong></td>
133
+ <td><strong>90.8</strong></td><td><strong>94.4</strong></td>
134
+ <td><strong>35.7</strong></td><td><strong>51.1</strong></td>
135
+ <td><strong>59.3</strong></td><td><strong>67.1</strong></td>
136
+ <td><strong>59.1</strong></td>
137
+ </tr>
138
+ </tbody>
139
+ </table>
140
 
 
141
 
142
  <table>
143
  <thead>