shamz15531 commited on
Commit
f45d1bc
·
1 Parent(s): 9c0d174

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -3
README.md CHANGED
@@ -1,3 +1,140 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ - en
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - pytorch
9
+ library_name: transformers
10
+ ---
11
+ # Fanar-1-9B-Instruct
12
+
13
+ **Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) and [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/). It is the instruction-tuned version of [Fanar-1-9B](). Built on top of `google/gemma-2-9b`, Fanar is further pretrained on 1T Arabic and English tokens. Fanar pays particular attention to the richness of the Arabic language by supporting both a diverse set of Arabic dialects including Modern Standard Arabic (MSA), Levantine, and Egyptian. Fanar, through meticulous curation of the pretraining and instruction-tuning data, is aligned with Arab cultural values.
14
+
15
+ ---
16
+
17
+ ## Model Details
18
+
19
+ | Attribute | Value |
20
+ |---------------------------|------------------------------------|
21
+ | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) and [HBKU](https://www.hbku.edu.qa/) |
22
+ | Model Type | Autoregressive Transformer |
23
+ | Parameter Count | 8.7 Billion |
24
+ | Context Length | 4096 Tokens |
25
+ | Precision | bfloat16 |
26
+ | Input | text only |
27
+ | Output | text only |
28
+ | Training Framework | [LitGPT](https://github.com/Lightning-AI/litgpt) |
29
+ | Pretraining Token Count | 1 Trillion (ar + en) |
30
+ | SFT Instructions | 4.5M |
31
+ | DPO Preference Pairs | 250K |
32
+ | Languages | Arabic, English |
33
+ | License | Apache 2.0 |
34
+
35
+ ---
36
+
37
+ ## Model Training
38
+
39
+ Additional dataset and training details can be found in our [report](https://arxiv.org/pdf/2501.13944).
40
+
41
+ ### Pretraining
42
+ Fanar was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: 450B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 450B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 100B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
43
+
44
+ ### Post-training
45
+ Fanar underwent a two-phase post-training pipeline:
46
+
47
+ | Phase | Method | Size |
48
+ |-------|--------|------|
49
+ | SFT (Supervised Fine-tuning) | 4.5M Instructions |
50
+ | DPO (Direct Preference Optimization) | 250K Preference Pairs |
51
+
52
+ ---
53
+
54
+
55
+ ## Getting Started
56
+
57
+ Fanar is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer, AutoModelForCausalLM
61
+
62
+ model_name = "your-org/Fanar-1-9B-Instruct" # replace with actual HF repo
63
+
64
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
65
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
66
+
67
+ messages = [
68
+ {"role": "user", "content": "ما هي عاصمة قطر؟"},
69
+ ]
70
+
71
+ inputs = tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt")
72
+ outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_type_ids=False), max_new_tokens=256)
73
+
74
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
75
+ ```
76
+
77
+ ---
78
+
79
+ ## Intended Use
80
+
81
+ Fanar-1-9B-Instruct is built for:
82
+
83
+ - Conversational agents (Arabic only or bilingual)
84
+ - Cultural and dialectal question answering in Arabic
85
+ - Educational, governmental, and civic NLP applications in the Arab world
86
+ - Research on Arabic instruction tuning and alignment
87
+
88
+ Fanar is intended to be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content**
89
+
90
+ `A version of this model is currently deployed as part of a real-world system at (chat.fanar.qa).`
91
+
92
+ ---
93
+
94
+ ## Ethical Considerations & Limitations
95
+
96
+ Fanar is capable of generating fluent and contextually appropriate responses, but as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model should It is **not suitable for high-stakes decision making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases.
97
+
98
+ The output generated by this model is not considered a statement of QCRI, HBKU, or any other organization or individual.
99
+
100
+ ---
101
+
102
+ ## Evaluation
103
+
104
+ Evaluation results for Fanar-1-9B-Instruct will be released soon across Arabic and English benchmarks including:
105
+
106
+ - Arabic MMLU
107
+ - Ar-IFEval
108
+ - TruthfulQA
109
+ - HellaSwag
110
+ - ARC (Easy/Challenge)
111
+ - OpenBookQA
112
+
113
+ Evaluation was conducted using a modified version of the LM Evaluation Harness and internal cultural alignment benchmarks.
114
+
115
+ ---
116
+
117
+ ## Citation
118
+
119
+ If you use Fanar in your research or applications, please cite:
120
+
121
+ ```bibtex
122
+ @misc{fanarllm2025,
123
+ title={Fanar: An Arabic-Centric Multimodal Generative AI Platform},
124
+ author={Fanar Team},
125
+ year={2025},
126
+ url={https://arxiv.org/abs/2501.13944},
127
+ }
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Acknowledgements
133
+
134
+ This project is a collaboration between [Qatar Computing Research Institute (QCRI)](https://qcri.org) and [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa). We thank our engineering and research teams for their efforts in advancing Arabic-centric large language models.
135
+
136
+ ---
137
+
138
+ ## License
139
+
140
+ This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).