Update README.md
Browse files
README.md
CHANGED
|
@@ -10,11 +10,14 @@ library_name: transformers
|
|
| 10 |
base_model:
|
| 11 |
- QCRI/Fanar-1-9B
|
| 12 |
---
|
|
|
|
|
|
|
|
|
|
| 13 |
# Fanar-1-9B-Instruct
|
| 14 |
|
| 15 |
-
**Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](https://huggingface.co/QCRI/Fanar-1-9B). We continually
|
| 16 |
|
| 17 |
-
**Fanar-1-9B-Instruct** is a core component
|
| 18 |
|
| 19 |
We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
|
| 20 |
|
|
@@ -25,7 +28,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
|
|
| 25 |
| Attribute | Value |
|
| 26 |
|---------------------------|------------------------------------|
|
| 27 |
| Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
|
| 28 |
-
| Sponsored by | [
|
| 29 |
| Model Type | Autoregressive Transformer |
|
| 30 |
| Parameter Count | 8.7 Billion |
|
| 31 |
| Context Length | 4096 Tokens |
|
|
@@ -45,7 +48,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
|
|
| 45 |
## Model Training
|
| 46 |
|
| 47 |
#### Pretraining
|
| 48 |
-
Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and
|
| 49 |
|
| 50 |
#### Post-training
|
| 51 |
Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:
|
|
@@ -108,10 +111,10 @@ Fanar-1-9B-Instruct is built for:
|
|
| 108 |
|
| 109 |
- Conversational agents (Arabic only or bilingual)
|
| 110 |
- Cultural and dialectal question answering in Arabic
|
| 111 |
-
- Educational, governmental, and civic NLP applications
|
| 112 |
- Research on Arabic natural language generation and understanding
|
| 113 |
|
| 114 |
-
Fanar-1-9B-Instruct can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content
|
| 115 |
|
| 116 |
A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.
|
| 117 |
|
|
@@ -119,7 +122,7 @@ A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We
|
|
| 119 |
|
| 120 |
## Ethical Considerations & Limitations
|
| 121 |
|
| 122 |
-
Fanar-1-9B-Instruct is capable of generating fluent and contextually appropriate responses,
|
| 123 |
|
| 124 |
The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
|
| 125 |
|
|
@@ -167,7 +170,7 @@ If you use Fanar-1-9B-Instruct or the Fanar GenAI system in your research or app
|
|
| 167 |
## Acknowledgements
|
| 168 |
|
| 169 |
This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
|
| 170 |
-
Special thanks to the [
|
| 171 |
|
| 172 |
|
| 173 |
---
|
|
|
|
| 10 |
base_model:
|
| 11 |
- QCRI/Fanar-1-9B
|
| 12 |
---
|
| 13 |
+
|
| 14 |
+

|
| 15 |
+
|
| 16 |
# Fanar-1-9B-Instruct
|
| 17 |
|
| 18 |
+
**Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](https://huggingface.co/QCRI/Fanar-1-9B). We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects, including Gulf, Levantine, and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
|
| 19 |
|
| 20 |
+
**Fanar-1-9B-Instruct** is a core component of the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
|
| 21 |
|
| 22 |
We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
|
| 23 |
|
|
|
|
| 28 |
| Attribute | Value |
|
| 29 |
|---------------------------|------------------------------------|
|
| 30 |
| Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
|
| 31 |
+
| Sponsored by | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
|
| 32 |
| Model Type | Autoregressive Transformer |
|
| 33 |
| Parameter Count | 8.7 Billion |
|
| 34 |
| Context Length | 4096 Tokens |
|
|
|
|
| 48 |
## Model Training
|
| 49 |
|
| 50 |
#### Pretraining
|
| 51 |
+
Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
| 52 |
|
| 53 |
#### Post-training
|
| 54 |
Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:
|
|
|
|
| 111 |
|
| 112 |
- Conversational agents (Arabic only or bilingual)
|
| 113 |
- Cultural and dialectal question answering in Arabic
|
| 114 |
+
- Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
|
| 115 |
- Research on Arabic natural language generation and understanding
|
| 116 |
|
| 117 |
+
Fanar-1-9B-Instruct can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content.**
|
| 118 |
|
| 119 |
A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.
|
| 120 |
|
|
|
|
| 122 |
|
| 123 |
## Ethical Considerations & Limitations
|
| 124 |
|
| 125 |
+
Fanar-1-9B-Instruct is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision-making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B-Instruct and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
|
| 126 |
|
| 127 |
The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
|
| 128 |
|
|
|
|
| 170 |
## Acknowledgements
|
| 171 |
|
| 172 |
This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
|
| 173 |
+
Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
|
| 174 |
|
| 175 |
|
| 176 |
---
|