tiiuae
/

Falcon3-3B-Instruct

@@ -90,7 +90,7 @@ print(response)
 ## Benchmarks
 We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
- - We report **raw scores** obtained by applying chat template **without fewshot_as_multiturn** (unlike Llama3.1).
  - We use same batch-size across all models.
 <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
@@ -116,17 +116,17 @@ We report in the following table our internal pipeline benchmarks.
         <tr>
             <td rowspan="3">General</td>
             <td>MMLU (5-shot)</td>
-            <td>29.3</td>
-            <td>56.2</td>
-            <td><b>56.4</b></td>
-            <td>55.7</td>
         </tr>
         <tr>
             <td>MMLU-PRO (5-shot)</td>
-            <td>11.9</td>
-            <td>17.2</td>
-            <td>23.3</td>
-            <td><b>29.7</b></td>
         </tr>
         <tr>
             <td>IFEval</td>
@@ -138,21 +138,21 @@ We report in the following table our internal pipeline benchmarks.
         <tr>
             <td rowspan="3">Math</td>
             <td>GSM8K (5-shot)</td>
-            <td>68.5</td>
-            <td>58.5</td>
-            <td>46.9</td>
-            <td><b>71.9</b></td>
         </tr>
         <tr>
             <td>GSM8K (8-shot, COT)</td>
-            <td><b>74.5</b></td>
-            <td>64.0</td>
-            <td>46.5</td>
-            <td>71.6</td>
         </tr>
         <tr>
             <td>MATH Lvl-5 (4-shot)</td>
-            <td>2.4</td>
             <td>0.0</td>
             <td>0.0</td>
             <td><b>19.9</b></td>
@@ -160,10 +160,10 @@ We report in the following table our internal pipeline benchmarks.
         <tr>
             <td rowspan="5">Reasoning</td>
             <td>Arc Challenge (25-shot)</td>
-            <td>38.9</td>
-            <td>50.0</td>
-            <td>51.2</td>
-            <td><b>58.5</b></td>
         </tr>
         <tr>
             <td>GPQA (0-shot)</td>
@@ -181,16 +181,16 @@ We report in the following table our internal pipeline benchmarks.
         </tr>
         <tr>
             <td>MUSR (0-shot)</td>
-            <td>34.9</td>
             <td><b>40.2</b></td>
-            <td>38.9</td>
             <td>39.0</td>
         </tr>
         <tr>
             <td>BBH (3-shot)</td>
-            <td>33.1</td>
-            <td>44.1</td>
-            <td>38.1</td>
             <td><b>45.4</b></td>
         </tr>
         <tr>

 ## Benchmarks
 We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
+ - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
 <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
         <tr>
             <td rowspan="3">General</td>
             <td>MMLU (5-shot)</td>
+            <td>61.2</td>
+            <td><b>65.4</b></td>
+            <td>57.3</td>
+            <td>56.9</td>
         </tr>
         <tr>
             <td>MMLU-PRO (5-shot)</td>
+            <td>27.7</td>
+            <td><b>32.6</b></td>
+            <td>26.0</td>
+            <td>29.7</td>
         </tr>
         <tr>
             <td>IFEval</td>
         <tr>
             <td rowspan="3">Math</td>
             <td>GSM8K (5-shot)</td>
+            <td><b>76.8</b></td>
+            <td>56.7</td>
+            <td>29.8</td>
+            <td>74.8</td>
         </tr>
         <tr>
             <td>GSM8K (8-shot, COT)</td>
+            <td><b>78.8</b></td>
+            <td>60.8</td>
+            <td>35.0</td>
+            <td>78.0</td>
         </tr>
         <tr>
             <td>MATH Lvl-5 (4-shot)</td>
+            <td>14.6</td>
             <td>0.0</td>
             <td>0.0</td>
             <td><b>19.9</b></td>
         <tr>
             <td rowspan="5">Reasoning</td>
             <td>Arc Challenge (25-shot)</td>
+            <td>50.9</td>
+            <td>55.0</td>
+            <td><b>56.2</b></td>
+            <td>55.5</td>
         </tr>
         <tr>
             <td>GPQA (0-shot)</td>
         </tr>
         <tr>
             <td>MUSR (0-shot)</td>
+            <td>35.0</td>
             <td><b>40.2</b></td>
+            <td>38.7</td>
             <td>39.0</td>
         </tr>
         <tr>
             <td>BBH (3-shot)</td>
+            <td>41.8</td>
+            <td>44.5</td>
+            <td>39.5</td>
             <td><b>45.4</b></td>
         </tr>
         <tr>