shamz15531 commited on
Commit
6e5d11e
·
verified ·
1 Parent(s): e7018d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -10,11 +10,14 @@ library_name: transformers
10
  base_model:
11
  - QCRI/Fanar-1-9B
12
  ---
 
 
 
13
  # Fanar-1-9B-Instruct
14
 
15
- **Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](https://huggingface.co/QCRI/Fanar-1-9B). We continually pretrains the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects including Levantine and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
16
 
17
- **Fanar-1-9B-Instruct** is a core component within the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
18
 
19
  We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
20
 
@@ -25,7 +28,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
25
  | Attribute | Value |
26
  |---------------------------|------------------------------------|
27
  | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
28
- | Sponsored by | [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/)
29
  | Model Type | Autoregressive Transformer |
30
  | Parameter Count | 8.7 Billion |
31
  | Context Length | 4096 Tokens |
@@ -45,7 +48,7 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
45
  ## Model Training
46
 
47
  #### Pretraining
48
- Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
49
 
50
  #### Post-training
51
  Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:
@@ -108,10 +111,10 @@ Fanar-1-9B-Instruct is built for:
108
 
109
  - Conversational agents (Arabic only or bilingual)
110
  - Cultural and dialectal question answering in Arabic
111
- - Educational, governmental, and civic NLP applications focussed on the Arab world or Arabic-speaking audiences
112
  - Research on Arabic natural language generation and understanding
113
 
114
- Fanar-1-9B-Instruct can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
115
 
116
  A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.
117
 
@@ -119,7 +122,7 @@ A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We
119
 
120
  ## Ethical Considerations & Limitations
121
 
122
- Fanar-1-9B-Instruct is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B-Instruct and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
123
 
124
  The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
125
 
@@ -167,7 +170,7 @@ If you use Fanar-1-9B-Instruct or the Fanar GenAI system in your research or app
167
  ## Acknowledgements
168
 
169
  This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
170
- Special thanks to the [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
171
 
172
 
173
  ---
 
10
  base_model:
11
  - QCRI/Fanar-1-9B
12
  ---
13
+
14
+ ![Model Logo](./fanar_logo.jpg)
15
+
16
  # Fanar-1-9B-Instruct
17
 
18
+ **Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](https://huggingface.co/QCRI/Fanar-1-9B). We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects, including Gulf, Levantine, and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
19
 
20
+ **Fanar-1-9B-Instruct** is a core component of the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
21
 
22
  We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
23
 
 
28
  | Attribute | Value |
29
  |---------------------------|------------------------------------|
30
  | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
31
+ | Sponsored by | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
32
  | Model Type | Autoregressive Transformer |
33
  | Parameter Count | 8.7 Billion |
34
  | Context Length | 4096 Tokens |
 
48
  ## Model Training
49
 
50
  #### Pretraining
51
+ Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
52
 
53
  #### Post-training
54
  Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:
 
111
 
112
  - Conversational agents (Arabic only or bilingual)
113
  - Cultural and dialectal question answering in Arabic
114
+ - Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
115
  - Research on Arabic natural language generation and understanding
116
 
117
+ Fanar-1-9B-Instruct can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content.**
118
 
119
  A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.
120
 
 
122
 
123
  ## Ethical Considerations & Limitations
124
 
125
+ Fanar-1-9B-Instruct is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision-making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B-Instruct and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
126
 
127
  The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
128
 
 
170
  ## Acknowledgements
171
 
172
  This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
173
+ Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
174
 
175
 
176
  ---