Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,11 @@ tags:
|
|
| 16 |
|
| 17 |
|
| 18 |
# TL;DR
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
This repository contains the Falcon3-7B-Instruct, the best Instruct LLM under 8B at the time of release.
|
| 22 |
|
|
@@ -32,6 +36,10 @@ This repository contains the Falcon3-7B-Instruct, the best Instruct LLM under 8B
|
|
| 32 |
|
| 33 |
<br>
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
# Usage
|
| 36 |
|
| 37 |
Find below an example on how to use the model in `transformers` (Make sure to have the latest transformers, or the one built from source):
|
|
@@ -80,96 +88,7 @@ print(response)
|
|
| 80 |
|
| 81 |
</details>
|
| 82 |
|
| 83 |
-
|
| 84 |
-
# Training Details
|
| 85 |
-
Based on `tiiuae/Falcon3-7B-Base`, post-training stage is comprised of supervised finetuning followed by human preference alignement (DPO).
|
| 86 |
-
|
| 87 |
-
## Supervised finetuning
|
| 88 |
-
### Training Data
|
| 89 |
-
1.2 million diverse, high-quality samples Tulu-3, Open-Hermes, Numina an Apigen.
|
| 90 |
-
|
| 91 |
-
| Data type | ratio |
|
| 92 |
-
|--------------------------------------|-------|
|
| 93 |
-
| Conversations | 32% |
|
| 94 |
-
| STEM | 32% |
|
| 95 |
-
| Code | 12% |
|
| 96 |
-
| Safety | 9.1% |
|
| 97 |
-
| Multi lingual | 8.3% |
|
| 98 |
-
| Function call | 3.3% |
|
| 99 |
-
| NLP (summarization, generation, QA) | 3.2% |
|
| 100 |
-
|
| 101 |
-
#### Training Hyperparameters
|
| 102 |
-
|
| 103 |
-
<style type="text/css">
|
| 104 |
-
.tg {border-collapse:collapse;border-spacing:0;}
|
| 105 |
-
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
|
| 106 |
-
overflow:hidden;padding:10px 5px;word-break:normal;}
|
| 107 |
-
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
|
| 108 |
-
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
|
| 109 |
-
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
|
| 110 |
-
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
|
| 111 |
-
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
|
| 112 |
-
.tg .tg-ihkz{border-color:inherit;text-align:center;vertical-align:top}
|
| 113 |
-
.tg .tg-pcvp{border-color:inherit;text-align:left;vertical-align:top}
|
| 114 |
-
.tg .tg-j2vi{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
|
| 115 |
-
.tg .tg-amwm{border-color:inherit;text-align:left;vertical-align:top}
|
| 116 |
-
.tg .tg-0lax{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
|
| 117 |
-
</style>
|
| 118 |
-
<table class="tg"><thead>
|
| 119 |
-
<tr>
|
| 120 |
-
<th class="tg-7btt" rowspan="3">AdamW</th>
|
| 121 |
-
<th class="tg-c3ow">β1</th>
|
| 122 |
-
<th class="tg-0pky">0.9</th>
|
| 123 |
-
</tr>
|
| 124 |
-
<tr>
|
| 125 |
-
<th class="tg-ihkz">β2</th>
|
| 126 |
-
<th class="tg-pcvp">0.999</th>
|
| 127 |
-
</tr>
|
| 128 |
-
<tr>
|
| 129 |
-
<th class="tg-c3ow">weight decay</th>
|
| 130 |
-
<th class="tg-0pky">0.01</th>
|
| 131 |
-
</tr></thead>
|
| 132 |
-
<tbody>
|
| 133 |
-
<tr>
|
| 134 |
-
<td class="tg-j2vi" rowspan="4">Learning rate</td>
|
| 135 |
-
<td class="tg-ihkz">type</td>
|
| 136 |
-
<td class="tg-pcvp">linear decay</td>
|
| 137 |
-
</tr>
|
| 138 |
-
<tr>
|
| 139 |
-
<td class="tg-c3ow">init lr</td>
|
| 140 |
-
<td class="tg-0pky">5e-6</td>
|
| 141 |
-
</tr>
|
| 142 |
-
<tr>
|
| 143 |
-
<td class="tg-ihkz">final lr</td>
|
| 144 |
-
<td class="tg-pcvp">0</td>
|
| 145 |
-
</tr>
|
| 146 |
-
<tr>
|
| 147 |
-
<td class="tg-c3ow">warm rate</td>
|
| 148 |
-
<td class="tg-0pky">0.03</td>
|
| 149 |
-
</tr>
|
| 150 |
-
<tr>
|
| 151 |
-
<td class="tg-j2vi">Batch size</td>
|
| 152 |
-
<td class="tg-ihkz"></td>
|
| 153 |
-
<td class="tg-pcvp">64</td>
|
| 154 |
-
</tr>
|
| 155 |
-
<tr>
|
| 156 |
-
<td class="tg-amwm">Epochs</td>
|
| 157 |
-
<td class="tg-0lax"></td>
|
| 158 |
-
<td class="tg-0lax">2</td>
|
| 159 |
-
</tr>
|
| 160 |
-
</tbody>
|
| 161 |
-
</table>
|
| 162 |
-
|
| 163 |
-
## Human preference alignment - DPO
|
| 164 |
-
|
| 165 |
-
### Training Data
|
| 166 |
-
TO DO DO DO DO
|
| 167 |
-
|
| 168 |
-
#### Training Hyperparameters
|
| 169 |
-
TODODODODOD
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
# Evaluation
|
| 173 |
We report in the following table our internal pipeline benchmarks:
|
| 174 |
|
| 175 |
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
# TL;DR
|
| 19 |
+
Falcon3 family of Open Foundation Models is a set of pretrained and instruct LLMs ranging from 1B to 10B.
|
| 20 |
+
|
| 21 |
+
Achieves state of art results on reasoning, language understanding, instruction following, code and mathematics tasks.
|
| 22 |
+
|
| 23 |
+
Supports context length up to 32K.
|
| 24 |
|
| 25 |
This repository contains the Falcon3-7B-Instruct, the best Instruct LLM under 8B at the time of release.
|
| 26 |
|
|
|
|
| 36 |
|
| 37 |
<br>
|
| 38 |
|
| 39 |
+
## Model Architecture
|
| 40 |
+
Falcon 3 uses grouped query attention (GQA) for faster inference and a wider head dimension of 256.
|
| 41 |
+
High ROPE value is used to support long context understanding.
|
| 42 |
+
|
| 43 |
# Usage
|
| 44 |
|
| 45 |
Find below an example on how to use the model in `transformers` (Make sure to have the latest transformers, or the one built from source):
|
|
|
|
| 88 |
|
| 89 |
</details>
|
| 90 |
|
| 91 |
+
# Benchmarks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
We report in the following table our internal pipeline benchmarks:
|
| 93 |
|
| 94 |
|