shamz15531 commited on
Commit
a3ddb25
·
verified ·
1 Parent(s): fc4d957

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -30
README.md CHANGED
@@ -5,16 +5,18 @@ language:
5
  - en
6
  pipeline_tag: text-generation
7
  tags:
8
- - pytorch
9
  library_name: transformers
 
 
10
  ---
11
  # Fanar-1-9B-Instruct
12
 
13
- **Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](). Fanar continually pretrains the `google/gemma-2-9b` model on 1T Arabic and English tokens. Fanar pays particular attention to the richness of the Arabic language by supporting a diverse set of Arabic dialects including Modern Standard Arabic (MSA), Levantine, and Egyptian. Fanar, through meticulous curation of the pretraining and instruction-tuning data, is aligned with Arab cultural values.
14
 
15
  **Fanar-1-9B-Instruct** is a core component within the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
16
 
17
- We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding FANAR. We also provide an API to the model and our GenAI platform (request access [here](https://api.fanar.qa/request/en)).
18
 
19
  ---
20
 
@@ -22,13 +24,13 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
22
 
23
  | Attribute | Value |
24
  |---------------------------|------------------------------------|
25
- | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) and [HBKU](https://www.hbku.edu.qa/) |
26
- | Sponsored by | [MCIT](https://www.mcit.gov.qa/en/)
27
  | Model Type | Autoregressive Transformer |
28
  | Parameter Count | 8.7 Billion |
29
  | Context Length | 4096 Tokens |
30
- | Input | text only |
31
- | Output | text only |
32
  | Training Framework | [LitGPT](https://github.com/Lightning-AI/litgpt) |
33
  | Pretraining Token Count | 1 Trillion (ar + en) |
34
  | SFT Instructions | 4.5M |
@@ -87,56 +89,45 @@ Fanar-1-9B-Instruct is built for:
87
 
88
  - Conversational agents (Arabic only or bilingual)
89
  - Cultural and dialectal question answering in Arabic
90
- - Educational, governmental, and civic NLP applications in the Arab world
91
  - Research on Arabic natural language generation and understanding
92
 
93
  Fanar is intended to be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
94
 
95
- A version of this model is currently deployed as part of a real-world system at [chat.fanar.qa](chat.fanar.qa).
96
 
97
  ---
98
 
99
  ## Ethical Considerations & Limitations
100
 
101
- Fanar is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model should It is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
102
 
103
- The output generated by this model is not considered a statement of QCRI, HBKU, or any other organization or individual.
104
 
105
  ---
106
 
107
  ## Evaluation
108
 
109
- Evaluation results for Fanar-1-9B-Instruct will be released soon across Arabic and English benchmarks including:
110
-
111
- - Arabic MMLU
112
- - Ar-IFEval
113
- - TruthfulQA
114
- - HellaSwag
115
- - ARC (Easy/Challenge)
116
- - OpenBookQA
117
-
118
  Evaluation was conducted using a modified version of the LM Evaluation Harness and internal cultural alignment benchmarks.
119
 
120
  <div style="overflow-x: auto;">
121
 
122
  | Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
123
  |-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
124
- | Fanar-1-9B-it | 71.53% | **58.89%** | **67.69%** | **83.16%** | 82.54% | **67.15%** | **83.22%** | 80.02% | **74.60%** | **68.32%** | **66.29%** | **78.68%** | 72.40% | **67.68%** | **63.66%** | 59.03% | 59.63% | 60.62% |
125
- | ALLaM-7B-Instruct-preview | 60.72% | 54.89% | 68.59% | 76.35% | 80.52% | 51.62% | 75.80% | 74.52% | 46.63% | 57.31% | 63.66% | 76.31% | **74.20%** | 67.52% | 63.44% | **60.88%** | **62.50%** | **64.17%** |
126
- | aya-expanse-8b | 62.85% | 47.14% | 60.10% | 78.54% | **81.18%** | 56.40% | 70.78% | 77.11% | 8.26% | 53.18% | 59.74% | 70.20% | 67.30% | 63.00% | 59.41% | 56.53% | 53.52% | 53.71% |
127
  | c4ai-command-r7b-arabic-02-2025 | 66.91% | 49.54% | 63.06% | 74.67% | 78.02% | 49.15% | 72.78% | 79.80% | 30.33% | 49.38% | 64.44% | 73.82% | 69.20% | 62.30% | 60.99% | 56.69% | 54.78% | 56.06% |
128
  | AceGPT-v2-8B-Chat | 66.45% | 51.16% | 62.61% | 79.21% | 80.58% | 53.50% | 74.56% | 77.66% | 41.77% | 50.16% | 60.40% | 74.31% | 68.90% | 64.58% | 61.32% | 56.91% | 54.53% | 53.91% |
129
  | gemma-2-9b-it | 71.65% | 57.93% | 64.16% | 79.06% | 79.38% | 63.99% | 78.31% | **80.67%** | 60.95% | 56.11% | 64.21% | 73.69% | 68.60% | 61.26% | 59.96% | 57.24% | 57.95% | 59.25% |
130
  | jais-adapted-13b-chat | 56.64% | 44.45% | 58.97% | 80.86% | 80.47% | 54.27% | 67.52% | 75.24% | 44.05% | 46.41% | 56.56% | 65.46% | 65.30% | 61.10% | 58.05% | 55.77% | 52.87% | 53.59% |
131
  | jais-family-6p7b-chat | 49.42% | 41.59% | 55.80% | 72.04% | 74.05% | 44.62% | 65.11% | 72.04% | 53.68% | 48.20% | 54.73% | 61.72% | 64.10% | 62.51% | 60.12% | 57.24% | 49.11% | 47.49% |
132
- | Llama-3.1-8B-Instruct | 68.04% | 47.58% | 59.05% | 79.22% | 80.74% | 55.29% | 66.72% | 76.67% | 29.26% | 47.81% | TBD | 69.70% | 66.10% | 58.11% | 55.39% | 54.24% | 46.86% | 47.52% |
133
  | Qwen2.5-7B-Instruct | **74.21%** | 55.63% | 63.96% | 80.44% | 79.92% | 55.03% | 74.61% | 78.09% | 71.34% | 54.19% | 62.69% | 75.69% | 68.10% | 60.55% | 58.65% | 56.04% | 48.74% | 53.42% |
134
 
135
  </div>
136
 
137
 
138
-
139
-
140
  ---
141
 
142
  ## Citation
@@ -146,7 +137,7 @@ If you use Fanar in your research or applications, please cite:
146
  ```bibtex
147
  @misc{fanarllm2025,
148
  title={Fanar: An Arabic-Centric Multimodal Generative AI Platform},
149
- author={Fanar Team},
150
  year={2025},
151
  url={https://arxiv.org/abs/2501.13944},
152
  }
@@ -156,12 +147,12 @@ If you use Fanar in your research or applications, please cite:
156
 
157
  ## Acknowledgements
158
 
159
- This project is an effort by the [Qatar Computing Research Institute (QCRI)](https://qcri.org), part of [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa). We thank our engineering and research teams for their efforts in advancing Arabic-centric large language models.
160
- We thank the [MCIT](https://www.mcit.gov.qa/en/) for sponsoring this project and their continued support by providing the compute infrastructure.
161
 
162
 
163
  ---
164
 
165
  ## License
166
 
167
- This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
 
5
  - en
6
  pipeline_tag: text-generation
7
  tags:
8
+ - pytorch
9
  library_name: transformers
10
+ base_model:
11
+ - QCRI/Fanar-1-9B
12
  ---
13
  # Fanar-1-9B-Instruct
14
 
15
+ **Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](https://huggingface.co/QCRI/Fanar-1-9B). We continually pretrains the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects including Levantine and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.
16
 
17
  **Fanar-1-9B-Instruct** is a core component within the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.
18
 
19
+ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
20
 
21
  ---
22
 
 
24
 
25
  | Attribute | Value |
26
  |---------------------------|------------------------------------|
27
+ | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
28
+ | Sponsored by | [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/)
29
  | Model Type | Autoregressive Transformer |
30
  | Parameter Count | 8.7 Billion |
31
  | Context Length | 4096 Tokens |
32
+ | Input | Text only |
33
+ | Output | Text only |
34
  | Training Framework | [LitGPT](https://github.com/Lightning-AI/litgpt) |
35
  | Pretraining Token Count | 1 Trillion (ar + en) |
36
  | SFT Instructions | 4.5M |
 
89
 
90
  - Conversational agents (Arabic only or bilingual)
91
  - Cultural and dialectal question answering in Arabic
92
+ - Educational, governmental, and civic NLP applications focussed on the Arab world or Arabic-speaking audiences
93
  - Research on Arabic natural language generation and understanding
94
 
95
  Fanar is intended to be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
96
 
97
+ A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.
98
 
99
  ---
100
 
101
  ## Ethical Considerations & Limitations
102
 
103
+ Fanar is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
104
 
105
+ The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
106
 
107
  ---
108
 
109
  ## Evaluation
110
 
 
 
 
 
 
 
 
 
 
111
  Evaluation was conducted using a modified version of the LM Evaluation Harness and internal cultural alignment benchmarks.
112
 
113
  <div style="overflow-x: auto;">
114
 
115
  | Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
116
  |-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
117
+ | Fanar-1-9B-it | 71.53% | **58.89%** | 67.69% | **83.16%** | **82.54%** | **67.15%** | **83.22%** | 80.02% | **74.60%** | **68.32%** | **66.29%** | **78.68%** | 72.40% | **67.68%** | **63.66%** | 59.03% | 59.63% | 60.62% |
118
+ | ALLaM-7B-Instruct-preview | 60.72% | 54.89% | **68.59%** | 76.35% | 80.52% | 51.62% | 75.80% | 74.52% | 46.63% | 57.31% | 63.66% | 76.31% | **74.20%** | 67.52% | 63.44% | **60.88%** | **62.50%** | **64.17%** |
119
+ | aya-expanse-8b | 62.85% | 47.14% | 60.10% | 78.54% | 81.18% | 56.40% | 70.78% | 77.11% | 8.26% | 53.18% | 59.74% | 70.20% | 67.30% | 63.00% | 59.41% | 56.53% | 53.52% | 53.71% |
120
  | c4ai-command-r7b-arabic-02-2025 | 66.91% | 49.54% | 63.06% | 74.67% | 78.02% | 49.15% | 72.78% | 79.80% | 30.33% | 49.38% | 64.44% | 73.82% | 69.20% | 62.30% | 60.99% | 56.69% | 54.78% | 56.06% |
121
  | AceGPT-v2-8B-Chat | 66.45% | 51.16% | 62.61% | 79.21% | 80.58% | 53.50% | 74.56% | 77.66% | 41.77% | 50.16% | 60.40% | 74.31% | 68.90% | 64.58% | 61.32% | 56.91% | 54.53% | 53.91% |
122
  | gemma-2-9b-it | 71.65% | 57.93% | 64.16% | 79.06% | 79.38% | 63.99% | 78.31% | **80.67%** | 60.95% | 56.11% | 64.21% | 73.69% | 68.60% | 61.26% | 59.96% | 57.24% | 57.95% | 59.25% |
123
  | jais-adapted-13b-chat | 56.64% | 44.45% | 58.97% | 80.86% | 80.47% | 54.27% | 67.52% | 75.24% | 44.05% | 46.41% | 56.56% | 65.46% | 65.30% | 61.10% | 58.05% | 55.77% | 52.87% | 53.59% |
124
  | jais-family-6p7b-chat | 49.42% | 41.59% | 55.80% | 72.04% | 74.05% | 44.62% | 65.11% | 72.04% | 53.68% | 48.20% | 54.73% | 61.72% | 64.10% | 62.51% | 60.12% | 57.24% | 49.11% | 47.49% |
125
+ | Llama-3.1-8B-Instruct | 68.04% | 47.58% | 59.05% | 79.22% | 80.74% | 55.29% | 66.72% | 76.67% | 29.26% | 47.81% | 55.97% | 69.70% | 66.10% | 58.11% | 55.39% | 54.24% | 46.86% | 47.52% |
126
  | Qwen2.5-7B-Instruct | **74.21%** | 55.63% | 63.96% | 80.44% | 79.92% | 55.03% | 74.61% | 78.09% | 71.34% | 54.19% | 62.69% | 75.69% | 68.10% | 60.55% | 58.65% | 56.04% | 48.74% | 53.42% |
127
 
128
  </div>
129
 
130
 
 
 
131
  ---
132
 
133
  ## Citation
 
137
  ```bibtex
138
  @misc{fanarllm2025,
139
  title={Fanar: An Arabic-Centric Multimodal Generative AI Platform},
140
+ author={Fanar Team and Ummar Abbas and Mohammad Shahmeer Ahmad and Firoj Alam and Enes Altinisik and Ehsannedin Asgari and Yazan Boshmaf and Sabri Boughorbel and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Masoomali Fatehkia and Anastasios Fragkopoulos and Maram Hasanain and Majd Hawasly and Mus'ab Husaini and Soon-Gyo Jung and Ji Kim Lucas and Walid Magdy and Safa Messaoud and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Zan Naeem and Mourad Ouzzani and Dorde Popovic and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang and Ahmed Ali and Yassine El Kheir and Xiaosong Ma and Chaoyi Ruan}},
141
  year={2025},
142
  url={https://arxiv.org/abs/2501.13944},
143
  }
 
147
 
148
  ## Acknowledgements
149
 
150
+ This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
151
+ Special thanks to the [Minisitry of Communications and Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.
152
 
153
 
154
  ---
155
 
156
  ## License
157
 
158
+ This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).