Update README.md
Browse files
README.md
CHANGED
|
@@ -97,6 +97,19 @@ Answer:
|
|
| 97 |
|
| 98 |
And the teseting was done on ``` [Letter]. [Text answer]```
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
### Second evaluation: (type 0)
|
| 102 |
```
|
|
@@ -117,6 +130,19 @@ Answer:
|
|
| 117 |
And the teseting was done on ``` [Letter]. [Text answer]```
|
| 118 |
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
### Third evaluation: (type 2)
|
| 122 |
```
|
|
@@ -140,6 +166,19 @@ Your Response:
|
|
| 140 |
And the teseting was done on ``` [Letter]. [Text answer]```
|
| 141 |
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
### First evaluation: (type 0)
|
| 145 |
```
|
|
@@ -160,6 +199,20 @@ Answer:
|
|
| 160 |
And the teseting was done on ``` [Letter]```
|
| 161 |
|
| 162 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
### Framework versions
|
| 165 |
|
|
|
|
| 97 |
|
| 98 |
And the teseting was done on ``` [Letter]. [Text answer]```
|
| 99 |
|
| 100 |
+
| Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
|
| 101 |
+
|------------------|----------------|-------------------------------|
|
| 102 |
+
| ARC Challenge | 63.90% | 62.41% |
|
| 103 |
+
| ARC Easy | 81.64% | 77.87% |
|
| 104 |
+
| GPQA | 31.92% | 30.58% |
|
| 105 |
+
| Math QA | 31.84% | 31.11% |
|
| 106 |
+
| MCQA Evals | 42.60% | 38.44% |
|
| 107 |
+
| MMLU | 50.94% | 50.94% |
|
| 108 |
+
| MMLU Pro | 15.19% | 13.79% |
|
| 109 |
+
| MuSR | 53.04% | 51.19% |
|
| 110 |
+
| NLP4Education | 44.49% | 41.71% |
|
| 111 |
+
| **Overall** | **46.17%** | **44.23%** |
|
| 112 |
+
|
| 113 |
|
| 114 |
### Second evaluation: (type 0)
|
| 115 |
```
|
|
|
|
| 130 |
And the teseting was done on ``` [Letter]. [Text answer]```
|
| 131 |
|
| 132 |
|
| 133 |
+
| Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
|
| 134 |
+
|------------------|----------------|-------------------------------|
|
| 135 |
+
| ARC Challenge | 67.17% | 64.51% |
|
| 136 |
+
| ARC Easy | 83.71% | 79.57% |
|
| 137 |
+
| GPQA | 28.35% | 28.79% |
|
| 138 |
+
| Math QA | 36.38% | 34.66% |
|
| 139 |
+
| MCQA Evals | 45.06% | 38.31% |
|
| 140 |
+
| MMLU | 50.68% | 50.68% |
|
| 141 |
+
| MMLU Pro | 16.22% | 14.31% |
|
| 142 |
+
| MuSR | 53.04% | 51.19% |
|
| 143 |
+
| NLP4Education | 48.71% | 44.18% |
|
| 144 |
+
| **Overall** | **47.70%** | **45.13%** |
|
| 145 |
+
|
| 146 |
|
| 147 |
### Third evaluation: (type 2)
|
| 148 |
```
|
|
|
|
| 166 |
And the teseting was done on ``` [Letter]. [Text answer]```
|
| 167 |
|
| 168 |
|
| 169 |
+
| Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
|
| 170 |
+
|------------------|----------------|-------------------------------|
|
| 171 |
+
| ARC Challenge | 49.97% | 46.02% |
|
| 172 |
+
| ARC Easy | 63.34% | 55.84% |
|
| 173 |
+
| GPQA | 17.41% | 20.09% |
|
| 174 |
+
| Math QA | 29.90% | 29.50% |
|
| 175 |
+
| MCQA Evals | 33.64% | 32.47% |
|
| 176 |
+
| MMLU | 50.94% | 50.94% |
|
| 177 |
+
| MMLU Pro | 14.09% | 11.21% |
|
| 178 |
+
| MuSR | 53.04% | 51.19% |
|
| 179 |
+
| NLP4Education | 38.47% | 37.06% |
|
| 180 |
+
| **Overall** | **38.98%** | **37.15%** |
|
| 181 |
+
|
| 182 |
|
| 183 |
### First evaluation: (type 0)
|
| 184 |
```
|
|
|
|
| 199 |
And the teseting was done on ``` [Letter]```
|
| 200 |
|
| 201 |
|
| 202 |
+
| Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
|
| 203 |
+
|------------------|----------------|-------------------------------|
|
| 204 |
+
| ARC Challenge | 68.46% | 68.46% |
|
| 205 |
+
| ARC Easy | 84.11% | 84.11% |
|
| 206 |
+
| GPQA | 37.95% | 37.95% |
|
| 207 |
+
| Math QA | 39.31% | 39.31% |
|
| 208 |
+
| MCQA Evals | 45.06% | 45.06% |
|
| 209 |
+
| MMLU | 50.75% | 50.75% |
|
| 210 |
+
| MMLU Pro | 19.25% | 19.25% |
|
| 211 |
+
| MuSR | 51.72% | 51.72% |
|
| 212 |
+
| NLP4Education | 49.80% | 49.80% |
|
| 213 |
+
| **Overall** | **49.60%** | **49.60%** |
|
| 214 |
+
|
| 215 |
+
|
| 216 |
|
| 217 |
### Framework versions
|
| 218 |
|