Commit ·
0ab4626
1
Parent(s): 4ad5a83
Update README.md
Browse files
README.md
CHANGED
|
@@ -12,6 +12,8 @@ library_name: transformers
|
|
| 12 |
|
| 13 |
**Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) and [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/). It is the instruction-tuned version of [Fanar-1-9B](). Built on top of `google/gemma-2-9b`, Fanar is further pretrained on 1T Arabic and English tokens. Fanar pays particular attention to the richness of the Arabic language by supporting a diverse set of Arabic dialects including Modern Standard Arabic (MSA), Levantine, and Egyptian. Fanar, through meticulous curation of the pretraining and instruction-tuning data, is aligned with Arab cultural values.
|
| 14 |
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
## Model Details
|
|
@@ -36,12 +38,10 @@ library_name: transformers
|
|
| 36 |
|
| 37 |
## Model Training
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
### Pretraining
|
| 42 |
Fanar was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: 450B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 450B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 100B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
| 43 |
|
| 44 |
-
### Post-training
|
| 45 |
Fanar underwent a two-phase post-training pipeline:
|
| 46 |
|
| 47 |
| Phase | Method | Size |
|
|
|
|
| 12 |
|
| 13 |
**Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) and [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/). It is the instruction-tuned version of [Fanar-1-9B](). Built on top of `google/gemma-2-9b`, Fanar is further pretrained on 1T Arabic and English tokens. Fanar pays particular attention to the richness of the Arabic language by supporting a diverse set of Arabic dialects including Modern Standard Arabic (MSA), Levantine, and Egyptian. Fanar, through meticulous curation of the pretraining and instruction-tuning data, is aligned with Arab cultural values.
|
| 14 |
|
| 15 |
+
We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding FANAR.
|
| 16 |
+
|
| 17 |
---
|
| 18 |
|
| 19 |
## Model Details
|
|
|
|
| 38 |
|
| 39 |
## Model Training
|
| 40 |
|
| 41 |
+
#### Pretraining
|
|
|
|
|
|
|
| 42 |
Fanar was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: 450B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 450B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 100B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
| 43 |
|
| 44 |
+
#### Post-training
|
| 45 |
Fanar underwent a two-phase post-training pipeline:
|
| 46 |
|
| 47 |
| Phase | Method | Size |
|