fdalvi commited on
Commit
7b00a70
·
verified ·
1 Parent(s): e829a15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +355 -3
README.md CHANGED
@@ -1,3 +1,355 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - pytorch
8
+ - arabic-poetry
9
+ - classical-arabic
10
+ - prosody
11
+ library_name: transformers
12
+ base_model: aubmindlab/aragpt2-mega
13
+ ---
14
+
15
+ <p align="center">
16
+ <img src="./assets/fanar_logo.jpg" width="200"/>
17
+ </p>
18
+
19
+ # Fanar-2-Diwan (Arabic Poetry Generation)
20
+
21
+ **Fanar-2-Diwan** is a specialized Arabic poetry generation model developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is part of the **Fanar 2.0 release**, a comprehensive Arabic-centric multimodal generative AI platform that includes specialized models for [text generation](https://huggingface.co/QCRI/Fanar-2-27B-Instruct), [image generation](https://huggingface.co/QCRI/Fanar-2-Oryx-IG) and [image understanding](https://huggingface.co/QCRI/Fanar-2-Oryx-IVU).
22
+
23
+ Fanar-2-Diwan specializes in generating classical Arabic poetry adhering to the strict metrical patterns (*Buhur*), rhyme schemes (*Qafiyah*), and prosodic rules (*Arud*) established by al-Farahidi. Trained on **118K high-quality poems** with complete metadata (meter, rhyme, topic, era, poet), Fanar-2-Diwan achieves the **highest fluency score (6.88/10)** among all evaluated models and demonstrates strong performance across poeticness, meaning, and coherence metrics.
24
+
25
+ We have published a [report](https://arxiv.org/abs/2603.16397) with all the details regarding Fanar 2.0 GenAI platform. We also provide a [chat interface](https://chat.fanar.qa), mobile apps for [iOS](https://apps.apple.com/jo/app/fanar-فنار/id6741857943) and [Android](https://play.google.com/store/apps/details?id=com.fanarmobile), and [API access](https://api.fanar.qa/docs) to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
26
+
27
+ ---
28
+
29
+ ## Model Details
30
+
31
+ | Attribute | Value |
32
+ |---------------------------|------------------------------------|
33
+ | Developed by | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/) |
34
+ | Sponsored by | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
35
+ | Model Type | Decoder-only Transformer |
36
+ | Base Model | AraGPT2-Mega |
37
+ | Parameter Count | 1.46 Billion |
38
+ | Input | Text (metadata-driven prompts) |
39
+ | Output | Classical Arabic poetry |
40
+ | Continual Pretraining | 47M words (Arabic literature) |
41
+ | Fine-tuning Data | 118K poems with metadata |
42
+ | Language | Arabic |
43
+ | License | Apache 2.0 |
44
+
45
+ ---
46
+
47
+ ## Classical Arabic Poetry
48
+
49
+ Arabic poetry represents a culturally significant literary tradition governed by well-defined structural constraints:
50
+
51
+ **Metrical Patterns (البحور - Buhur):**
52
+ Categorized by al-Farahidi, 16 classical meters define rhythmic structure through specific patterns of long and short syllables. Common meters include:
53
+
54
+ - **الطويل** (Al-Taweel) - The Long
55
+ - **البسيط** (Al-Baseet) - The Simple
56
+ - **الكامل** (Al-Kamel) - The Complete
57
+ - **الوافر** (Al-Wafir) - The Abundant
58
+ - **الرمل** (Al-Ramal) - The ''Ramal''
59
+
60
+ **Prosodic Rules (عروض - Arud):**
61
+ Governs syllable patterns, ensuring adherence to metrical feet across hemistichs (half-verses).
62
+
63
+ **Rhyme Scheme (قافية - Qaafiyah):**
64
+ Strict rhyme requirements where all verses typically end with the same rhyme letter and pattern.
65
+
66
+ ---
67
+
68
+ ## Data
69
+
70
+ ### Continual Pretraining
71
+
72
+ The base model of Fanar-2-Diwan was continually pretrained using **47 million words** of Arabic literary texts to specialize it in classical Arabic prose and poetic language, strengthen understanding of classical Arabic vocabulary, literary expressions, and rhetorical devices.
73
+
74
+ **Data Sources:**
75
+
76
+ - adab.com
77
+ - adabworld.com
78
+ - Other literature-focused websites
79
+
80
+ ### Fine-tuning
81
+
82
+ **Data Sources:**
83
+ - 126K poems crawled from AlDiwan.net
84
+ - Only 76% had complete metadata initially
85
+ - Poets having more 10 poems in AlDiwan are considered in this version
86
+
87
+ **Data Quality Enhancement:**
88
+
89
+ 1. **Metadata Completion** (using GPT-4o + manual validation):
90
+ - **High accuracy**: Rhyme letter, era prediction
91
+ - **Moderate accuracy**: Topic classification
92
+ - **Lower accuracy**: Meter identification
93
+ - Final dataset: **118K poems** (94%) with complete metadata
94
+
95
+ 2. **Era Mapping**:
96
+ - Fine-grained labels → 5 canonical literary periods
97
+ - Reduces era skew in training data
98
+ - The following mapping was used to convert the era from the dataset to one of the [five canonical periods](https://ar.wikipedia.org/wiki/%D8%B9%D8%B5%D9%88%D8%B1_%D8%A7%D9%84%D8%A3%D8%AF%D8%A8_%D8%A7%D9%84%D8%B9%D8%B1%D8%A8%D9%8A).
99
+
100
+ - العصر الجاهلي: ما قبل الإسلام
101
+ - المخضرمون: ما قبل الإسلام
102
+ - العصر الأموي: الإسلامي
103
+ - العصر الاسلامي: الإسلامي
104
+ - العصر العباسي: العباسي
105
+ - العصر الأيوبي: العباسي
106
+ - العصر الفاطمي: العباسي
107
+ - العصر الأندلسي: الدول
108
+ - العصر المملوكي: الدول
109
+ - العصر العثماني: الدول
110
+ - العصر الحديث: الحديث
111
+
112
+ ## Model Usage
113
+ Fanar-2-Diwan uses a structured representation to enable **controllable generation** by specifying desired attributes.
114
+
115
+ Poems are encoded with explicit metadata control:
116
+
117
+ `Hemistich1 [meter][topic][era][poet_id] Hemistich2 [rhyme_letter]`
118
+
119
+ Usage Notes:
120
+
121
+ 1. The expected values for `meter`, `topic`, `era`, `poet_id`, and `rhyme_letter` are found in the file [`poem_generation_metadata.py`](https://huggingface.co/QCRI/Fanar-2-Diwan/blob/main/poem_generation_metadata.py)
122
+
123
+ 2. The input text of `Hemistich1` and `Hemistich2` should be provided without diacritics.
124
+
125
+
126
+ **Example:**
127
+ ```
128
+ فيا ليت الشباب يعود يوما [الوافر][دين][العباسي][poet_273] فأخبره بما صنع المشيب [ب]
129
+ ```
130
+
131
+ `poet_273` refers to `ابو العتاهية` in this example.
132
+
133
+ ---
134
+
135
+ ## Poetry Example
136
+
137
+ Below is an example of a poem generated by Fanar-2-Diwan given the following metadata and the first verse:
138
+
139
+ - العصر: العثماني
140
+ - الشاعر: الأمير الصنعاني
141
+ - الموضوع: رومانسية
142
+ - البحر: الكامل
143
+ - القافية: ع
144
+
145
+ ```
146
+ لا والغرام وما تجن الأضلع .. من مهجة بيد الهوى تتقطع
147
+ ودموع عيني في هواك غزيرة .. وعلى جفوني من جفونك تدمع
148
+ يا أيها الرشأ الذي قلبي له .. في كل جارحة متيم مصرع
149
+ إن لم تكن لي منك يا رشأ الحمى .. عين فأنت لكل صب موجع
150
+ لولاك ما ذاب الفؤاد ولا غدا .. للعاشقين جميعهم يتضعضع
151
+ ما كنت أعلم أن حبك قاتلي .. حتى رميت بسهم لحاظك فأصعصع
152
+ إني لأعجب كيف تصبر عاشقا .. عن حبه وهو الكئيب الموجع
153
+ كيف التصبر عن حبيب قلبه .. بفؤادي أبدا عليه موزع
154
+ لم أنس ليلة زارني متلثما .. متبسما والدمع منه ينبعق
155
+ وبكى فظننته بدرا طالعا .. أو غصن بان نابتا يتفتقع
156
+ ```
157
+
158
+ Human Evaluation Scores for the example: Poeticness=6, Meaning=5, Coherence=7, Fluency=6.
159
+
160
+ ---
161
+
162
+ ## Getting Started
163
+
164
+ (Tested on Python 3.11.2 and 3.13.5, `transformers` v4.51.3 and `arabert` v1.0.1)
165
+
166
+ ```python
167
+ from transformers import AutoTokenizer, AutoModelForCausalLM
168
+
169
+ def process_poem_output(tokenizer, outputs):
170
+ generated_poem = tokenizer.decode(outputs[0], skip_special_tokens=False)
171
+ first_verse = generated_poem.splitlines()[0]
172
+ qafiya = first_verse[first_verse.rfind('[') + 1:first_verse.rfind(']')].strip()
173
+ trimmed_string = generated_poem[:generated_poem.rfind(f'[{qafiya}]') + len(f'[{qafiya}]')]
174
+ cleaned_lines = [line.split('[')[0].strip() + '\t' + line.split('[')[-2].split(']')[-1].strip() for line in trimmed_string.splitlines()]
175
+
176
+ return '\n'.join(cleaned_lines)
177
+
178
+ model_name = "QCRI/Fanar-2-Diwan"
179
+
180
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
181
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)
182
+
183
+ # Metadata-driven prompt
184
+ # Format: Hemistich1 [meter][topic][era][poet] Hemistich2 [rhyme_letter]
185
+ prompt = "على أي حال لليالي أعاتب[الطويل][الدول][حزن][poet_1196]وأي صروف للزمان أغالب[ب]"
186
+ # poet_1196 is ابن خلدون
187
+ # Mamluk era (العصر المملوكي) is mapped to Countries era (عصر الدول والممالك)
188
+
189
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
190
+ outputs = model.generate(
191
+ **inputs,
192
+ max_length=256,
193
+ num_beams=5,
194
+ top_p=0.92,
195
+ do_sample=True,
196
+ repetition_penalty=3.0,
197
+ no_repeat_ngram_size = 6,
198
+ early_stopping=True)
199
+
200
+ print(process_poem_output(tokenizer, outputs))
201
+ ```
202
+
203
+ Sample generated poem:
204
+
205
+ ```
206
+ على أي حال لليالي أعاتب .. وأي صروف للزمان أغالب
207
+ إذا لم يكن لي في الخطوب معول .. فإن خطوب الدهر لا شك غالب
208
+ أعلل نفسي بالمنى وهي ضلة .. وأكتمها ما قد جنته النوائب
209
+ وما أنا إلا كالزناد مضاءه .. له بين أحناء الضلوع مضارب
210
+ ولولا الهوى ما ذقت طعم مرارة .. ولا سقيت كأسا من الحب شارب
211
+ سقى الله أكناف الحمى كل غادية .. من المزن أو صوب الحيا وهو ساكب
212
+ وعهدي بهاتيك المعاهد جنة .. تظل بها الأرواح فيها ترائب
213
+ فيا حبذا تلك المعاهد والحمى .. ويا حبذا تلك المنازل والربى
214
+ وحبذا تلك الطلول وإن نأت .. بها غربة الدار التي هي آيب
215
+ وماذا عسى يجدي التذكر والهوى .. يعبر عما في الضمير الكاتب
216
+ لقد كنت أخشى أن يطول بعادنا .. فأصبحت أخشى أن يطول النوى
217
+ ```
218
+
219
+ ---
220
+
221
+ ## Evaluation
222
+
223
+ ### Human Evaluation (Expert Linguists)
224
+
225
+ Two expert Arabic linguists independently rated **50 poems** (5 eras) on a 0-10 scale across four criteria:
226
+
227
+ | Model | Poeticness | Meaning | Coherence | Fluency |
228
+ |-------|-----------|---------|-----------|---------|
229
+ | **Fanar-Diwan** | 4.81 ± 1.57 | 4.83 ± 1.14 | 5.52 ± 1.27 | **6.88 ± 1.12** |
230
+ | GPT-4o | **5.73 ± 1.81** | **5.52 ± 1.30** | **6.31 ± 1.39** | 6.25 ± 1.06 |
231
+ | GPT-4o-mini | 1.70 ± 0.76 | 3.70 ± 1.15 | 3.92 ± 1.08 | 3.82 ± 1.16 |
232
+ | Fanar-27B | 0.77 ± 1.13 | 2.74 ± 1.22 | 3.45 ± 1.70 | 5.06 ± 1.41 |
233
+ | Fanar-9B | 0.08 ± 0.35 | 2.21 ± 1.17 | 2.83 ± 1.43 | 4.12 ± 1.65 |
234
+ | ALLaM-7B | 0.50 ± 0.97 | 0.50 ± 0.74 | 0.54 ± 0.85 | 1.44 ± 1.44 |
235
+ | Jais-13B | 0.06 ± 0.24 | 0.52 ± 0.82 | 0.67 ± 1.08 | 1.88 ± 2.13 |
236
+
237
+ **Key Findings:**
238
+
239
+ - **Fanar-Diwan achieves highest fluency** (6.88), even surpassing GPT-4o (6.25)
240
+ - Competitive with GPT-4o on meaning (4.83 vs 5.52) and coherence (5.52 vs 6.31)
241
+ - Significantly outperforms general-purpose Arabic models (Fanar-27B, ALLaM, Jais)
242
+ - Specialized training on poetry corpus yields substantial improvements
243
+
244
+ **Benchmark Details:**
245
+
246
+ - **Evaluation set**: 50 poems from 5 core literary eras
247
+ - **Evaluators**: 2 expert Arabic linguists
248
+ - **Inter-annotator agreement**: High (Cohen's kappa > 0.7)
249
+ - **Scoring range**: 0-10 scale
250
+ - **Metrics**: Poeticness, Meaning, Coherence, Fluency
251
+
252
+ ---
253
+
254
+ ## Intended Use, Limitations & Ethical Considerations
255
+
256
+ Fanar-2-Diwan is built for classical Arabic poetry generation adhering to traditional prosodic rules, educational applications teaching Arabic prosody (Arud) and metrics (Buhur), cultural preservation of classical Arabic poetic traditions, literary research on computational poetry generation, creative writing assistance, and poetry completion and variation generation.
257
+
258
+ - **Strengths:** The model adheres to classical metrical patterns (Buhur) and rhyme schemes (Qafiya), generates culturally appropriate content, achieves the best fluency among evaluated models (6.88/10).
259
+ - **Limitations:** Poetry quality varies (averaging ~5/10 on poeticness), and the model may produce semantically weak verses. Meter detection in training data had lower accuracy, and the model cannot fully match human poet creativity or produce modern/non-classical forms.
260
+ - **Ethical Considerations:** Always disclose AI-generated content and provide attribution when publishing. Poetry generated by this model may touch on sensitive cultural and religious themes and should be treated with appropriate respect. The model is excellent for learning and creative assistance, but is not a replacement for human poets. Individual poem quality varies, and human review is recommended before publication. This model should not be used for plagiarism, misrepresenting cultural or religious content, commercial purposes without proper attribution, high-stakes literary competitions without disclosure, professional literary publications without review, or religious and ceremonial contexts requiring perfect metrical accuracy.
261
+
262
+ Kindly refer to our [Terms of Service](https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
263
+
264
+ The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT, or any other organization or individual.
265
+
266
+ <!---
267
+ ## Future Improvements
268
+
269
+ Planned enhancements for future versions:
270
+
271
+ 1. **Larger training corpus**:
272
+ - The Poetry Encyclopedia (poetry.dct.gov.ae)
273
+ - The Poets Gate (poetsgate.com)
274
+ - Additional verified sources
275
+
276
+ 2. **Improved metadata quality**:
277
+ - Better meter classification
278
+ - Enhanced topic taxonomy
279
+ - More granular era labels
280
+
281
+ 3. **Better diacritization**:
282
+ - Improved Bi-LSTM models
283
+ - Hybrid cascaded-joint approaches
284
+ - Context-aware diacritization
285
+
286
+ 4. **Model scaling**:
287
+ - Larger base models (2.7B, 7B parameters)
288
+ - More training data
289
+ - Longer training iterations
290
+
291
+ 5. **Enhanced controllability**:
292
+ - Theme and emotion control
293
+ - Length specification
294
+ - Rhetorical device selection
295
+ -->
296
+ ---
297
+
298
+
299
+ ## Fanar Platform
300
+
301
+ While Fanar-2-27B-Instruct is a powerful standalone model, it is part of the broader **Fanar Platform**—an integrated Arabic-centric multimodal AI ecosystem that provides enhanced capabilities and continuous updates. The platform includes:
302
+
303
+ **Core Capabilities:**
304
+
305
+ - **Text Generation**: Multiple conversational models optimized for different tasks
306
+ - **Speech (Aura)**: Speech-to-text (short-form and long-form) and text-to-speech synthesis with Arabic dialect support and bilingual Arabic-English capabilities
307
+ - **Image Understanding (Oryx-IVU)**: Vision-language model for culturally-grounded image and video understanding including Arabic calligraphy recognition
308
+ - **Image Generation (Oryx-IG)**: Culturally-aligned text-to-image generation trained on taxonomy-driven data across 23,000+ cultural search terms
309
+ - **Machine Translation (FanarShaheen)**: High-quality bilingual Arabic↔English translation across diverse domains (e.g., news, STEM, and medical)
310
+ - **Poetry Generation (Diwan)**: Classical Arabic poetry generation respecting prosodic meters (Buhur) and maintaining diacritization accuracy
311
+
312
+ **Specialized Systems:**
313
+
314
+ - **Fanar-Sadiq**: Multi-agent Islamic question-answering system with 9 specialized tools (Fiqh reasoning, Quran/Hadith retrieval, zakat/inheritance calculation, prayer times, and Hijri calendar). Deployed in production on [IslamWeb](https://islamweb.net) and [IslamOnline](https://islamonline.net) platforms.
315
+ - **Safety & Moderation**: Fanar-Guard and culturally-informed content filtering trained on 468K annotated Arabic-English safety examples
316
+
317
+ **Access Points:**
318
+
319
+ - **[Fanar Chat](https://chat.fanar.qa)**: Web conversational interface integrating all modalities
320
+ - **[iOS](https://apps.apple.com/jo/app/fanar-فنار/id6741857943) and [Android](https://play.google.com/store/apps/details?id=com.fanarmobile) apps**: Mobile apps for on-the-go access to the Fanar Platform
321
+ - **[Fanar API](https://api.fanar.qa)**: Programmatic access to models and specialized capabilities
322
+
323
+ The Fanar Platform continuously evolves with model updates, new capabilities, and improved safety mechanisms. For production deployments requiring the latest features, multimodal integration, cross-model orchestration, and ongoing support, we recommend using the [Fanar Platform](https://fanar.qa) rather than the standalone models published here.
324
+
325
+ ---
326
+
327
+ ## Citation
328
+
329
+ If you use Fanar-2-Diwan or the Fanar 2.0 GenAI platform in your research or applications, please cite:
330
+
331
+ ```bibtex
332
+ @misc{fanarteam2026fanar20arabicgenerative,
333
+ title={Fanar 2.0: Arabic Generative AI Stack},
334
+ author={FANAR TEAM and Ummar Abbas and Mohammad Shahmeer Ahmad and Minhaj Ahmad and Abdulaziz Al-Homaid and Anas Al-Nuaimi and Enes Altinisik and Ehsaneddin Asgari and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Asim Ersoy and Masoomali Fatehkia and Mohammed Qusay Hashim and Majd Hawasly and Mohamed Hefeeda and Mus'ab Husaini and Keivin Isufaj and Soon-Gyo Jung and Houssam Lachemat and Ji Kim Lucas and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Mourad Ouzzani and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang},
335
+ year={2026},
336
+ eprint={2603.16397},
337
+ archivePrefix={arXiv},
338
+ primaryClass={cs.CL},
339
+ url={https://arxiv.org/abs/2603.16397},
340
+ }
341
+ ```
342
+
343
+ ---
344
+
345
+ ## Acknowledgements
346
+
347
+ This project is from [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
348
+
349
+ Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure needed to develop and serve the platform through the Google Cloud Platform.
350
+
351
+ ---
352
+
353
+ ## License
354
+
355
+ This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).