Update README.md
Browse files
README.md
CHANGED
|
@@ -267,118 +267,136 @@ We include more details and release our evaluation code at [FuseEval](https://gi
|
|
| 267 |
|
| 268 |
The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
|
| 269 |
|
| 270 |
-
|
|
|
|
|
|
|
| 271 |
<table class="js-sort-table table hidden">
|
| 272 |
<tr>
|
| 273 |
<td class="js-sort-string"><strong>Benchmarks</strong></td>
|
| 274 |
-
<td class="js-sort-string"><strong>Llama-3.
|
| 275 |
-
<td class="js-sort-string"><strong>
|
| 276 |
-
<td class="js-sort-string"><strong>FuseChat-Llama-3.
|
|
|
|
| 277 |
</tr>
|
| 278 |
|
| 279 |
<tr>
|
| 280 |
<td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
|
| 281 |
-
<td>
|
| 282 |
-
<td>
|
| 283 |
-
<td
|
|
|
|
| 284 |
</tr>
|
| 285 |
|
| 286 |
<tr>
|
| 287 |
<td>Arena-Hard (WR %)</td>
|
| 288 |
-
<td>
|
| 289 |
-
<td>
|
| 290 |
-
<td
|
|
|
|
| 291 |
</tr>
|
| 292 |
|
| 293 |
<tr>
|
| 294 |
<td>MT-Bench</td>
|
| 295 |
-
<td>
|
| 296 |
-
<td>
|
| 297 |
-
<td
|
|
|
|
| 298 |
</tr>
|
| 299 |
|
| 300 |
<tr>
|
| 301 |
<td>AlignBench v1.1</td>
|
| 302 |
-
<td>
|
| 303 |
-
<td>
|
| 304 |
-
<td
|
|
|
|
| 305 |
</tr>
|
| 306 |
|
| 307 |
<tr>
|
| 308 |
<td>GSM8K</td>
|
| 309 |
-
<td>
|
| 310 |
-
<td><strong>
|
| 311 |
-
<td>
|
|
|
|
| 312 |
</tr>
|
| 313 |
|
| 314 |
<tr>
|
| 315 |
<td>MATH</td>
|
| 316 |
-
<td>
|
| 317 |
-
<td>
|
| 318 |
-
<td
|
|
|
|
| 319 |
</tr>
|
| 320 |
|
| 321 |
<tr>
|
| 322 |
-
<td>
|
| 323 |
-
<td>
|
| 324 |
-
<td>
|
| 325 |
-
<td
|
|
|
|
| 326 |
</tr>
|
| 327 |
|
| 328 |
<tr>
|
| 329 |
<td>LiveBench 0831</td>
|
| 330 |
-
<td>
|
| 331 |
-
<td>
|
| 332 |
-
<td
|
|
|
|
| 333 |
</tr>
|
| 334 |
-
|
| 335 |
<tr>
|
| 336 |
<td>MMLU-Pro</td>
|
| 337 |
-
<td>
|
| 338 |
-
<td
|
| 339 |
-
<td>
|
|
|
|
| 340 |
</tr>
|
| 341 |
|
| 342 |
<tr>
|
| 343 |
<td>MMLU-redux</td>
|
| 344 |
-
<td>
|
| 345 |
-
<td>
|
| 346 |
-
<td
|
|
|
|
| 347 |
</tr>
|
| 348 |
|
| 349 |
<tr>
|
| 350 |
<td>GPQA-Diamond</td>
|
| 351 |
-
<td>
|
| 352 |
-
<td>
|
| 353 |
-
<td><strong>
|
|
|
|
| 354 |
</tr>
|
| 355 |
|
| 356 |
<tr>
|
| 357 |
<td>HumanEval</td>
|
| 358 |
-
<td>
|
| 359 |
-
<td
|
| 360 |
-
<td>
|
|
|
|
| 361 |
</tr>
|
| 362 |
|
| 363 |
<tr>
|
| 364 |
<td>MBPP</td>
|
| 365 |
-
<td><strong>
|
| 366 |
-
<td>
|
| 367 |
-
<td>
|
|
|
|
| 368 |
</tr>
|
| 369 |
|
| 370 |
<tr>
|
| 371 |
<td>LiveCodeBench<br>2408-2411</td>
|
| 372 |
-
<td>
|
| 373 |
-
<td>
|
| 374 |
-
<td
|
|
|
|
| 375 |
</tr>
|
| 376 |
|
| 377 |
<tr>
|
| 378 |
<td>Average</td>
|
| 379 |
-
<td>
|
| 380 |
-
<td>
|
| 381 |
-
<td
|
|
|
|
| 382 |
</tr>
|
| 383 |
</table>
|
| 384 |
|
|
|
|
| 267 |
|
| 268 |
The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
|
| 269 |
|
| 270 |
+
|
| 271 |
+
### FuseChat-Llama-3.1-8B-Instruct Performance
|
| 272 |
+
|
| 273 |
<table class="js-sort-table table hidden">
|
| 274 |
<tr>
|
| 275 |
<td class="js-sort-string"><strong>Benchmarks</strong></td>
|
| 276 |
+
<td class="js-sort-string"><strong>Llama-3.1-8B-Instruct</strong></td>
|
| 277 |
+
<td class="js-sort-string"><strong>Llama-3.1-Tulu-3-8B</strong></td>
|
| 278 |
+
<td class="js-sort-string"><strong>FuseChat-Llama-3.1-8B-SFT</strong></td>
|
| 279 |
+
<td class="js-sort-string"><strong>FuseChat-Llama-3.1-8B-Instruct</strong></td>
|
| 280 |
</tr>
|
| 281 |
|
| 282 |
<tr>
|
| 283 |
<td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
|
| 284 |
+
<td>28.3</td>
|
| 285 |
+
<td>33.4</td>
|
| 286 |
+
<td>41.3</td>
|
| 287 |
+
<td><strong>65.4</strong></td>
|
| 288 |
</tr>
|
| 289 |
|
| 290 |
<tr>
|
| 291 |
<td>Arena-Hard (WR %)</td>
|
| 292 |
+
<td>28.1</td>
|
| 293 |
+
<td>45.6</td>
|
| 294 |
+
<td>38.7</td>
|
| 295 |
+
<td><strong>58.2</strong></td>
|
| 296 |
</tr>
|
| 297 |
|
| 298 |
<tr>
|
| 299 |
<td>MT-Bench</td>
|
| 300 |
+
<td>8.38</td>
|
| 301 |
+
<td>8.34</td>
|
| 302 |
+
<td>8.54</td>
|
| 303 |
+
<td><strong>9</strong></td>
|
| 304 |
</tr>
|
| 305 |
|
| 306 |
<tr>
|
| 307 |
<td>AlignBench v1.1</td>
|
| 308 |
+
<td>4.61</td>
|
| 309 |
+
<td>6.2</td>
|
| 310 |
+
<td>6.25</td>
|
| 311 |
+
<td><strong>6.69</strong></td>
|
| 312 |
</tr>
|
| 313 |
|
| 314 |
<tr>
|
| 315 |
<td>GSM8K</td>
|
| 316 |
+
<td>85.9</td>
|
| 317 |
+
<td><strong>88.6</strong></td>
|
| 318 |
+
<td>87</td>
|
| 319 |
+
<td>88</td>
|
| 320 |
</tr>
|
| 321 |
|
| 322 |
<tr>
|
| 323 |
<td>MATH</td>
|
| 324 |
+
<td>50.7</td>
|
| 325 |
+
<td>47.5</td>
|
| 326 |
+
<td>54.7</td>
|
| 327 |
+
<td><strong>55.2</strong></td>
|
| 328 |
</tr>
|
| 329 |
|
| 330 |
<tr>
|
| 331 |
+
<td>AMC 23</td>
|
| 332 |
+
<td>25</td>
|
| 333 |
+
<td>25</td>
|
| 334 |
+
<td>30</td>
|
| 335 |
+
<td><strong>37.5</strong></td>
|
| 336 |
</tr>
|
| 337 |
|
| 338 |
<tr>
|
| 339 |
<td>LiveBench 0831</td>
|
| 340 |
+
<td>27.6</td>
|
| 341 |
+
<td>30.1</td>
|
| 342 |
+
<td>30.2</td>
|
| 343 |
+
<td><strong>32</strong></td>
|
| 344 |
</tr>
|
| 345 |
+
|
| 346 |
<tr>
|
| 347 |
<td>MMLU-Pro</td>
|
| 348 |
+
<td><strong>50</strong></td>
|
| 349 |
+
<td>42.9</td>
|
| 350 |
+
<td>47.8</td>
|
| 351 |
+
<td>49.2</td>
|
| 352 |
</tr>
|
| 353 |
|
| 354 |
<tr>
|
| 355 |
<td>MMLU-redux</td>
|
| 356 |
+
<td>67.2</td>
|
| 357 |
+
<td>66.3</td>
|
| 358 |
+
<td>68.4</td>
|
| 359 |
+
<td><strong>69.2</strong></td>
|
| 360 |
</tr>
|
| 361 |
|
| 362 |
<tr>
|
| 363 |
<td>GPQA-Diamond</td>
|
| 364 |
+
<td>33.8</td>
|
| 365 |
+
<td>35.9</td>
|
| 366 |
+
<td><strong>37.9</strong></td>
|
| 367 |
+
<td>34.9</td>
|
| 368 |
</tr>
|
| 369 |
|
| 370 |
<tr>
|
| 371 |
<td>HumanEval</td>
|
| 372 |
+
<td>69.5</td>
|
| 373 |
+
<td>66.5</td>
|
| 374 |
+
<td>69.5</td>
|
| 375 |
+
<td><strong>71.3</strong></td>
|
| 376 |
</tr>
|
| 377 |
|
| 378 |
<tr>
|
| 379 |
<td>MBPP</td>
|
| 380 |
+
<td><strong>75.4</strong></td>
|
| 381 |
+
<td>56.3</td>
|
| 382 |
+
<td>71.4</td>
|
| 383 |
+
<td>72</td>
|
| 384 |
</tr>
|
| 385 |
|
| 386 |
<tr>
|
| 387 |
<td>LiveCodeBench<br>2408-2411</td>
|
| 388 |
+
<td>12.3</td>
|
| 389 |
+
<td>10.6</td>
|
| 390 |
+
<td>12.6</td>
|
| 391 |
+
<td><strong>13.1</strong></td>
|
| 392 |
</tr>
|
| 393 |
|
| 394 |
<tr>
|
| 395 |
<td>Average</td>
|
| 396 |
+
<td>40.5</td>
|
| 397 |
+
<td>40.2</td>
|
| 398 |
+
<td>43.2</td>
|
| 399 |
+
<td><strong>47.3</strong></td>
|
| 400 |
</tr>
|
| 401 |
</table>
|
| 402 |
|