tiiuae
/

Falcon3-7B-Instruct

@@ -16,7 +16,11 @@ tags:
 # TL;DR
-Falcon 3 family of Open Foundation Models is a set of pretrained and instruct LLMs ranging from 1B to 10B.
 This repository contains the Falcon3-7B-Instruct, the best Instruct LLM under 8B at the time of release.
@@ -32,6 +36,10 @@ This repository contains the Falcon3-7B-Instruct, the best Instruct LLM under 8B
 <br>
 # Usage
 Find below an example on how to use the model in `transformers` (Make sure to have the latest transformers, or the one built from source):
@@ -80,96 +88,7 @@ print(response)
 </details>
-# Training Details
-Based on `tiiuae/Falcon3-7B-Base`, post-training stage is comprised of supervised finetuning followed by human preference alignement (DPO).
-## Supervised finetuning
-### Training Data
-1.2 million diverse, high-quality samples Tulu-3, Open-Hermes, Numina an Apigen.
-| Data type                            | ratio |
-|--------------------------------------|-------|
-| Conversations                        | 32%   |
-| STEM                                 | 32%   |
-| Code                                 | 12%   |
-| Safety                               | 9.1%  |
-| Multi lingual                        | 8.3%  |
-| Function call                        | 3.3%  |
-| NLP (summarization,  generation, QA) | 3.2%  |
-#### Training Hyperparameters
-<style type="text/css">
-.tg  {border-collapse:collapse;border-spacing:0;}
-.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
-.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
-.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
-.tg .tg-ihkz{border-color:inherit;text-align:center;vertical-align:top}
-.tg .tg-pcvp{border-color:inherit;text-align:left;vertical-align:top}
-.tg .tg-j2vi{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
-.tg .tg-amwm{border-color:inherit;text-align:left;vertical-align:top}
-.tg .tg-0lax{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
-</style>
-<table class="tg"><thead>
-  <tr>
-    <th class="tg-7btt" rowspan="3">AdamW</th>
-    <th class="tg-c3ow">β1</th>
-    <th class="tg-0pky">0.9</th>
-  </tr>
-  <tr>
-    <th class="tg-ihkz">β2</th>
-    <th class="tg-pcvp">0.999</th>
-  </tr>
-  <tr>
-    <th class="tg-c3ow">weight decay</th>
-    <th class="tg-0pky">0.01</th>
-  </tr></thead>
-<tbody>
-  <tr>
-    <td class="tg-j2vi" rowspan="4">Learning rate</td>
-    <td class="tg-ihkz">type</td>
-    <td class="tg-pcvp">linear decay</td>
-  </tr>
-  <tr>
-    <td class="tg-c3ow">init lr</td>
-    <td class="tg-0pky">5e-6</td>
-  </tr>
-  <tr>
-    <td class="tg-ihkz">final lr</td>
-    <td class="tg-pcvp">0</td>
-  </tr>
-  <tr>
-    <td class="tg-c3ow">warm rate</td>
-    <td class="tg-0pky">0.03</td>
-  </tr>
-  <tr>
-    <td class="tg-j2vi">Batch size</td>
-    <td class="tg-ihkz"></td>
-    <td class="tg-pcvp">64</td>
-  </tr>
-  <tr>
-    <td class="tg-amwm">Epochs</td>
-    <td class="tg-0lax"></td>
-    <td class="tg-0lax">2</td>
-  </tr>
-</tbody>
-</table>
-## Human preference alignment - DPO
-### Training Data
-TO DO DO DO DO
-#### Training Hyperparameters
-TODODODODOD
-# Evaluation
 We report in the following table our internal pipeline benchmarks:

 # TL;DR
+Falcon3 family of Open Foundation Models is a set of pretrained and instruct LLMs ranging from 1B to 10B.
+Achieves state of art results on reasoning, language understanding, instruction following, code and mathematics tasks.
+Supports context length up to 32K.
 This repository contains the Falcon3-7B-Instruct, the best Instruct LLM under 8B at the time of release.
 <br>
+## Model Architecture
+Falcon 3 uses grouped query attention (GQA) for faster inference and a wider head dimension of 256.
+High ROPE value is used to support long context understanding.
 # Usage
 Find below an example on how to use the model in `transformers` (Make sure to have the latest transformers, or the one built from source):
 </details>
+# Benchmarks
 We report in the following table our internal pipeline benchmarks: