Commit ·
f45d1bc
1
Parent(s): 9c0d174
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,140 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- ar
|
| 5 |
+
- en
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- pytorch
|
| 9 |
+
library_name: transformers
|
| 10 |
+
---
|
| 11 |
+
# Fanar-1-9B-Instruct
|
| 12 |
+
|
| 13 |
+
**Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) and [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/). It is the instruction-tuned version of [Fanar-1-9B](). Built on top of `google/gemma-2-9b`, Fanar is further pretrained on 1T Arabic and English tokens. Fanar pays particular attention to the richness of the Arabic language by supporting both a diverse set of Arabic dialects including Modern Standard Arabic (MSA), Levantine, and Egyptian. Fanar, through meticulous curation of the pretraining and instruction-tuning data, is aligned with Arab cultural values.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
| Attribute | Value |
|
| 20 |
+
|---------------------------|------------------------------------|
|
| 21 |
+
| Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) and [HBKU](https://www.hbku.edu.qa/) |
|
| 22 |
+
| Model Type | Autoregressive Transformer |
|
| 23 |
+
| Parameter Count | 8.7 Billion |
|
| 24 |
+
| Context Length | 4096 Tokens |
|
| 25 |
+
| Precision | bfloat16 |
|
| 26 |
+
| Input | text only |
|
| 27 |
+
| Output | text only |
|
| 28 |
+
| Training Framework | [LitGPT](https://github.com/Lightning-AI/litgpt) |
|
| 29 |
+
| Pretraining Token Count | 1 Trillion (ar + en) |
|
| 30 |
+
| SFT Instructions | 4.5M |
|
| 31 |
+
| DPO Preference Pairs | 250K |
|
| 32 |
+
| Languages | Arabic, English |
|
| 33 |
+
| License | Apache 2.0 |
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Model Training
|
| 38 |
+
|
| 39 |
+
Additional dataset and training details can be found in our [report](https://arxiv.org/pdf/2501.13944).
|
| 40 |
+
|
| 41 |
+
### Pretraining
|
| 42 |
+
Fanar was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: 450B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 450B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 100B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
| 43 |
+
|
| 44 |
+
### Post-training
|
| 45 |
+
Fanar underwent a two-phase post-training pipeline:
|
| 46 |
+
|
| 47 |
+
| Phase | Method | Size |
|
| 48 |
+
|-------|--------|------|
|
| 49 |
+
| SFT (Supervised Fine-tuning) | 4.5M Instructions |
|
| 50 |
+
| DPO (Direct Preference Optimization) | 250K Preference Pairs |
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
## Getting Started
|
| 56 |
+
|
| 57 |
+
Fanar is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 61 |
+
|
| 62 |
+
model_name = "your-org/Fanar-1-9B-Instruct" # replace with actual HF repo
|
| 63 |
+
|
| 64 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 65 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
|
| 66 |
+
|
| 67 |
+
messages = [
|
| 68 |
+
{"role": "user", "content": "ما هي عاصمة قطر؟"},
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
+
inputs = tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt")
|
| 72 |
+
outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_type_ids=False), max_new_tokens=256)
|
| 73 |
+
|
| 74 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Intended Use
|
| 80 |
+
|
| 81 |
+
Fanar-1-9B-Instruct is built for:
|
| 82 |
+
|
| 83 |
+
- Conversational agents (Arabic only or bilingual)
|
| 84 |
+
- Cultural and dialectal question answering in Arabic
|
| 85 |
+
- Educational, governmental, and civic NLP applications in the Arab world
|
| 86 |
+
- Research on Arabic instruction tuning and alignment
|
| 87 |
+
|
| 88 |
+
Fanar is intended to be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
|
| 89 |
+
|
| 90 |
+
`A version of this model is currently deployed as part of a real-world system at (chat.fanar.qa).`
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## Ethical Considerations & Limitations
|
| 95 |
+
|
| 96 |
+
Fanar is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model should It is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases.
|
| 97 |
+
|
| 98 |
+
The output generated by this model is not considered a statement of QCRI, HBKU, or any other organization or individual.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## Evaluation
|
| 103 |
+
|
| 104 |
+
Evaluation results for Fanar-1-9B-Instruct will be released soon across Arabic and English benchmarks including:
|
| 105 |
+
|
| 106 |
+
- Arabic MMLU
|
| 107 |
+
- Ar-IFEval
|
| 108 |
+
- TruthfulQA
|
| 109 |
+
- HellaSwag
|
| 110 |
+
- ARC (Easy/Challenge)
|
| 111 |
+
- OpenBookQA
|
| 112 |
+
|
| 113 |
+
Evaluation was conducted using a modified version of the LM Evaluation Harness and internal cultural alignment benchmarks.
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## Citation
|
| 118 |
+
|
| 119 |
+
If you use Fanar in your research or applications, please cite:
|
| 120 |
+
|
| 121 |
+
```bibtex
|
| 122 |
+
@misc{fanarllm2025,
|
| 123 |
+
title={Fanar: An Arabic-Centric Multimodal Generative AI Platform},
|
| 124 |
+
author={Fanar Team},
|
| 125 |
+
year={2025},
|
| 126 |
+
url={https://arxiv.org/abs/2501.13944},
|
| 127 |
+
}
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Acknowledgements
|
| 133 |
+
|
| 134 |
+
This project is a collaboration between [Qatar Computing Research Institute (QCRI)](https://qcri.org) and [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa). We thank our engineering and research teams for their efforts in advancing Arabic-centric large language models.
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## License
|
| 139 |
+
|
| 140 |
+
This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|