tiiuae
/

Falcon3-7B-Instruct

@@ -93,7 +93,7 @@ print(response)
 ## Benchmarks
 We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
- - We report **raw scores** obtained by applying chat template **without fewshot_as_multiturn** (unlike Llama3.1).
  - We use same batch-size across all models.
 <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
@@ -117,15 +117,15 @@ We report in the following table our internal pipeline benchmarks.
         <tr>
             <td rowspan="3">General</td>
             <td>MMLU (5-shot)</td>
-            <td>55.9</td>
-            <td><b>72.4</b></td>
-            <td>68</td>
         </tr>
         <tr>
             <td>MMLU-PRO (5-shot)</td>
-            <td>21.8</td>
-            <td>35.8</td>
-            <td><b>40.7</b></td>
         </tr>
         <tr>
             <td>IFEval</td>
@@ -136,28 +136,28 @@ We report in the following table our internal pipeline benchmarks.
         <tr>
             <td rowspan="3">Math</td>
             <td>GSM8K (5-shot)</td>
-            <td>78.1</td>
-            <td>77.5</td>
-            <td><b>79.1</b></td>
         </tr>
         <tr>
             <td>GSM8K (8-shot, COT)</td>
-            <td>79.8</td>
-            <td>72.7</td>
-            <td><b>80.9</b></td>
         </tr>
         <tr>
             <td>MATH Lvl-5 (4-shot)</td>
-            <td>10.4</td>
-            <td>26</td>
             <td><b>29.4</b></td>
         </tr>
         <tr>
             <td rowspan="5">Reasoning</td>
             <td>Arc Challenge (25-shot)</td>
-            <td>46.6</td>
-            <td>55.7</td>
-            <td><b>65.9</b></td>
         </tr>
         <tr>
             <td>GPQA (0-shot)</td>
@@ -179,8 +179,8 @@ We report in the following table our internal pipeline benchmarks.
         </tr>
         <tr>
             <td>BBH (3-shot)</td>
-            <td>43.7</td>
-            <td><b>53.9</b></td>
             <td>52.4</td>
         </tr>
         <tr>

 ## Benchmarks
 We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
+ - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
 <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
         <tr>
             <td rowspan="3">General</td>
             <td>MMLU (5-shot)</td>
+            <td>68.2</td>
+            <td><b>73.5</b></td>
+            <td>70.5</td>
         </tr>
         <tr>
             <td>MMLU-PRO (5-shot)</td>
+            <td>36.4</td>
+            <td><b>43.1</b></td>
+            <td>40.7</td>
         </tr>
         <tr>
             <td>IFEval</td>
         <tr>
             <td rowspan="3">Math</td>
             <td>GSM8K (5-shot)</td>
+            <td><b>82.6</b></td>
+            <td>72.0</td>
+            <td>81.4</td>
         </tr>
         <tr>
             <td>GSM8K (8-shot, COT)</td>
+            <td><b>85.4</b></td>
+            <td>76.6</td>
+            <td>79.7</td>
         </tr>
         <tr>
             <td>MATH Lvl-5 (4-shot)</td>
+            <td>15.4</td>
+            <td>-</td>
             <td><b>29.4</b></td>
         </tr>
         <tr>
             <td rowspan="5">Reasoning</td>
             <td>Arc Challenge (25-shot)</td>
+            <td>58.6</td>
+            <td>57.8</td>
+            <td><b>62.6</b></td>
         </tr>
         <tr>
             <td>GPQA (0-shot)</td>
         </tr>
         <tr>
             <td>BBH (3-shot)</td>
+            <td>48.6</td>
+            <td><b>54.2</b></td>
             <td>52.4</td>
         </tr>
         <tr>