Added Average score for text benchmark
#4
by
davasam - opened
README.md
CHANGED
|
@@ -62,7 +62,18 @@ Please see our [blog post](https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-
|
|
| 62 |
<th>Claude 4.5 Sonnet (thinking)</th>
|
| 63 |
<th>o3-mini (high)</th>
|
| 64 |
</tr>
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
<!-- Function Calling -->
|
| 67 |
<tr>
|
| 68 |
<td rowspan="5" class="category">Function Calling</td>
|
|
@@ -199,7 +210,7 @@ Please see our [blog post](https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-
|
|
| 199 |
<td>LCB</td>
|
| 200 |
<td>81</td>
|
| 201 |
<td>73</td>
|
| 202 |
-
<td>
|
| 203 |
<td>77</td>
|
| 204 |
<td>70</td>
|
| 205 |
<td>84</td>
|
|
@@ -210,7 +221,7 @@ Please see our [blog post](https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-
|
|
| 210 |
<td>SciCode</td>
|
| 211 |
<td>37</td>
|
| 212 |
<td>35</td>
|
| 213 |
-
<td>
|
| 214 |
<td>40</td>
|
| 215 |
<td>41</td>
|
| 216 |
<td>39</td>
|
|
@@ -244,7 +255,7 @@ Please see our [blog post](https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-
|
|
| 244 |
</tr>
|
| 245 |
<tr>
|
| 246 |
<td>Work-Arena L1</td>
|
| 247 |
-
<td>
|
| 248 |
<td>51.5</td>
|
| 249 |
<td>50.9</td>
|
| 250 |
<td>63.9</td>
|
|
@@ -304,7 +315,7 @@ Please see our [blog post](https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-
|
|
| 304 |
<td>MMLU Pro</td>
|
| 305 |
<td>79</td>
|
| 306 |
<td>77</td>
|
| 307 |
-
<td>
|
| 308 |
<td>85</td>
|
| 309 |
<td>83</td>
|
| 310 |
<td>84</td>
|
|
@@ -362,13 +373,17 @@ Please see our [blog post](https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-
|
|
| 362 |
<td>62</td>
|
| 363 |
<td>68</td>
|
| 364 |
<td>66</td>
|
| 365 |
-
<td>
|
| 366 |
</tr>
|
| 367 |
</table>
|
| 368 |
|
| 369 |
|
| 370 |
|
| 371 |
-
\*
|
|
|
|
|
|
|
|
|
|
|
|
|
| 372 |
|
| 373 |
---
|
| 374 |
|
|
|
|
| 62 |
<th>Claude 4.5 Sonnet (thinking)</th>
|
| 63 |
<th>o3-mini (high)</th>
|
| 64 |
</tr>
|
| 65 |
+
<tr>
|
| 66 |
+
<td></td>
|
| 67 |
+
<td>Average Score**</td>
|
| 68 |
+
<td>53.22</td>
|
| 69 |
+
<td>46.56</td>
|
| 70 |
+
<td>52.56</td>
|
| 71 |
+
<td>51.92</td>
|
| 72 |
+
<td>50.71</td>
|
| 73 |
+
<td>62.58</td>
|
| 74 |
+
<td>60.37</td>
|
| 75 |
+
<td>48.85</td>
|
| 76 |
+
</tr>
|
| 77 |
<!-- Function Calling -->
|
| 78 |
<tr>
|
| 79 |
<td rowspan="5" class="category">Function Calling</td>
|
|
|
|
| 210 |
<td>LCB</td>
|
| 211 |
<td>81</td>
|
| 212 |
<td>73</td>
|
| 213 |
+
<td>88</td>
|
| 214 |
<td>77</td>
|
| 215 |
<td>70</td>
|
| 216 |
<td>84</td>
|
|
|
|
| 221 |
<td>SciCode</td>
|
| 222 |
<td>37</td>
|
| 223 |
<td>35</td>
|
| 224 |
+
<td>39</td>
|
| 225 |
<td>40</td>
|
| 226 |
<td>41</td>
|
| 227 |
<td>39</td>
|
|
|
|
| 255 |
</tr>
|
| 256 |
<tr>
|
| 257 |
<td>Work-Arena L1</td>
|
| 258 |
+
<td>50.2</td>
|
| 259 |
<td>51.5</td>
|
| 260 |
<td>50.9</td>
|
| 261 |
<td>63.9</td>
|
|
|
|
| 315 |
<td>MMLU Pro</td>
|
| 316 |
<td>79</td>
|
| 317 |
<td>77</td>
|
| 318 |
+
<td>81</td>
|
| 319 |
<td>85</td>
|
| 320 |
<td>83</td>
|
| 321 |
<td>84</td>
|
|
|
|
| 373 |
<td>62</td>
|
| 374 |
<td>68</td>
|
| 375 |
<td>66</td>
|
| 376 |
+
<td>30***</td>
|
| 377 |
</tr>
|
| 378 |
</table>
|
| 379 |
|
| 380 |
|
| 381 |
|
| 382 |
+
\* This score is with [DCA](https://arxiv.org/pdf/2402.17463) enabled. Without this, the model scores 36.
|
| 383 |
+
|
| 384 |
+
\** The average score is calculated using all benchmarks except BFCL v3 Only and DeepResearchBench, since some models do not have scores for these two benchmarks.
|
| 385 |
+
|
| 386 |
+
\*** AA LCR score for o3-mini-high is projected score based on its AA Index score.
|
| 387 |
|
| 388 |
---
|
| 389 |
|