puneeshkhanna commited on
Commit
c2b78c8
·
verified ·
1 Parent(s): 0839f79

Update eval results with fewshot_as_multiturn

Browse files
Files changed (1) hide show
  1. README.md +20 -20
README.md CHANGED
@@ -93,7 +93,7 @@ print(response)
93
  ## Benchmarks
94
  We report in the following table our internal pipeline benchmarks.
95
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
96
- - We report **raw scores** obtained by applying chat template **without fewshot_as_multiturn** (unlike Llama3.1).
97
  - We use same batch-size across all models.
98
 
99
  <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
@@ -117,15 +117,15 @@ We report in the following table our internal pipeline benchmarks.
117
  <tr>
118
  <td rowspan="3">General</td>
119
  <td>MMLU (5-shot)</td>
120
- <td>55.9</td>
121
- <td><b>72.4</b></td>
122
- <td>68</td>
123
  </tr>
124
  <tr>
125
  <td>MMLU-PRO (5-shot)</td>
126
- <td>21.8</td>
127
- <td>35.8</td>
128
- <td><b>40.7</b></td>
129
  </tr>
130
  <tr>
131
  <td>IFEval</td>
@@ -136,28 +136,28 @@ We report in the following table our internal pipeline benchmarks.
136
  <tr>
137
  <td rowspan="3">Math</td>
138
  <td>GSM8K (5-shot)</td>
139
- <td>78.1</td>
140
- <td>77.5</td>
141
- <td><b>79.1</b></td>
142
  </tr>
143
  <tr>
144
  <td>GSM8K (8-shot, COT)</td>
145
- <td>79.8</td>
146
- <td>72.7</td>
147
- <td><b>80.9</b></td>
148
  </tr>
149
  <tr>
150
  <td>MATH Lvl-5 (4-shot)</td>
151
- <td>10.4</td>
152
- <td>26</td>
153
  <td><b>29.4</b></td>
154
  </tr>
155
  <tr>
156
  <td rowspan="5">Reasoning</td>
157
  <td>Arc Challenge (25-shot)</td>
158
- <td>46.6</td>
159
- <td>55.7</td>
160
- <td><b>65.9</b></td>
161
  </tr>
162
  <tr>
163
  <td>GPQA (0-shot)</td>
@@ -179,8 +179,8 @@ We report in the following table our internal pipeline benchmarks.
179
  </tr>
180
  <tr>
181
  <td>BBH (3-shot)</td>
182
- <td>43.7</td>
183
- <td><b>53.9</b></td>
184
  <td>52.4</td>
185
  </tr>
186
  <tr>
 
93
  ## Benchmarks
94
  We report in the following table our internal pipeline benchmarks.
95
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
96
+ - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
97
  - We use same batch-size across all models.
98
 
99
  <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
 
117
  <tr>
118
  <td rowspan="3">General</td>
119
  <td>MMLU (5-shot)</td>
120
+ <td>68.2</td>
121
+ <td><b>73.5</b></td>
122
+ <td>70.5</td>
123
  </tr>
124
  <tr>
125
  <td>MMLU-PRO (5-shot)</td>
126
+ <td>36.4</td>
127
+ <td><b>43.1</b></td>
128
+ <td>40.7</td>
129
  </tr>
130
  <tr>
131
  <td>IFEval</td>
 
136
  <tr>
137
  <td rowspan="3">Math</td>
138
  <td>GSM8K (5-shot)</td>
139
+ <td><b>82.6</b></td>
140
+ <td>72.0</td>
141
+ <td>81.4</td>
142
  </tr>
143
  <tr>
144
  <td>GSM8K (8-shot, COT)</td>
145
+ <td><b>85.4</b></td>
146
+ <td>76.6</td>
147
+ <td>79.7</td>
148
  </tr>
149
  <tr>
150
  <td>MATH Lvl-5 (4-shot)</td>
151
+ <td>15.4</td>
152
+ <td>-</td>
153
  <td><b>29.4</b></td>
154
  </tr>
155
  <tr>
156
  <td rowspan="5">Reasoning</td>
157
  <td>Arc Challenge (25-shot)</td>
158
+ <td>58.6</td>
159
+ <td>57.8</td>
160
+ <td><b>62.6</b></td>
161
  </tr>
162
  <tr>
163
  <td>GPQA (0-shot)</td>
 
179
  </tr>
180
  <tr>
181
  <td>BBH (3-shot)</td>
182
+ <td>48.6</td>
183
+ <td><b>54.2</b></td>
184
  <td>52.4</td>
185
  </tr>
186
  <tr>