andresnowak commited on
Commit
010f7d6
·
verified ·
1 Parent(s): 2425482

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -97,6 +97,19 @@ Answer:
97
 
98
  And the teseting was done on ``` [Letter]. [Text answer]```
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ### Second evaluation: (type 0)
102
  ```
@@ -117,6 +130,19 @@ Answer:
117
  And the teseting was done on ``` [Letter]. [Text answer]```
118
 
119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
  ### Third evaluation: (type 2)
122
  ```
@@ -140,6 +166,19 @@ Your Response:
140
  And the teseting was done on ``` [Letter]. [Text answer]```
141
 
142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  ### First evaluation: (type 0)
145
  ```
@@ -160,6 +199,20 @@ Answer:
160
  And the teseting was done on ``` [Letter]```
161
 
162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
 
164
  ### Framework versions
165
 
 
97
 
98
  And the teseting was done on ``` [Letter]. [Text answer]```
99
 
100
+ | Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
101
+ |------------------|----------------|-------------------------------|
102
+ | ARC Challenge | 63.90% | 62.41% |
103
+ | ARC Easy | 81.64% | 77.87% |
104
+ | GPQA | 31.92% | 30.58% |
105
+ | Math QA | 31.84% | 31.11% |
106
+ | MCQA Evals | 42.60% | 38.44% |
107
+ | MMLU | 50.94% | 50.94% |
108
+ | MMLU Pro | 15.19% | 13.79% |
109
+ | MuSR | 53.04% | 51.19% |
110
+ | NLP4Education | 44.49% | 41.71% |
111
+ | **Overall** | **46.17%** | **44.23%** |
112
+
113
 
114
  ### Second evaluation: (type 0)
115
  ```
 
130
  And the teseting was done on ``` [Letter]. [Text answer]```
131
 
132
 
133
+ | Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
134
+ |------------------|----------------|-------------------------------|
135
+ | ARC Challenge | 67.17% | 64.51% |
136
+ | ARC Easy | 83.71% | 79.57% |
137
+ | GPQA | 28.35% | 28.79% |
138
+ | Math QA | 36.38% | 34.66% |
139
+ | MCQA Evals | 45.06% | 38.31% |
140
+ | MMLU | 50.68% | 50.68% |
141
+ | MMLU Pro | 16.22% | 14.31% |
142
+ | MuSR | 53.04% | 51.19% |
143
+ | NLP4Education | 48.71% | 44.18% |
144
+ | **Overall** | **47.70%** | **45.13%** |
145
+
146
 
147
  ### Third evaluation: (type 2)
148
  ```
 
166
  And the teseting was done on ``` [Letter]. [Text answer]```
167
 
168
 
169
+ | Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
170
+ |------------------|----------------|-------------------------------|
171
+ | ARC Challenge | 49.97% | 46.02% |
172
+ | ARC Easy | 63.34% | 55.84% |
173
+ | GPQA | 17.41% | 20.09% |
174
+ | Math QA | 29.90% | 29.50% |
175
+ | MCQA Evals | 33.64% | 32.47% |
176
+ | MMLU | 50.94% | 50.94% |
177
+ | MMLU Pro | 14.09% | 11.21% |
178
+ | MuSR | 53.04% | 51.19% |
179
+ | NLP4Education | 38.47% | 37.06% |
180
+ | **Overall** | **38.98%** | **37.15%** |
181
+
182
 
183
  ### First evaluation: (type 0)
184
  ```
 
199
  And the teseting was done on ``` [Letter]```
200
 
201
 
202
+ | Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
203
+ |------------------|----------------|-------------------------------|
204
+ | ARC Challenge | 68.46% | 68.46% |
205
+ | ARC Easy | 84.11% | 84.11% |
206
+ | GPQA | 37.95% | 37.95% |
207
+ | Math QA | 39.31% | 39.31% |
208
+ | MCQA Evals | 45.06% | 45.06% |
209
+ | MMLU | 50.75% | 50.75% |
210
+ | MMLU Pro | 19.25% | 19.25% |
211
+ | MuSR | 51.72% | 51.72% |
212
+ | NLP4Education | 49.80% | 49.80% |
213
+ | **Overall** | **49.60%** | **49.60%** |
214
+
215
+
216
 
217
  ### Framework versions
218