QCRI
/

Fanar-2-Oryx-IVU

+---
+license: apache-2.0
+language:
+- ar
+- en
+pipeline_tag: image-text-to-text
+tags:
+- pytorch
+- vision-language
+- multimodal
+- cultural-understanding
+library_name: transformers
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+---
+<p align="center">
+  <img src="./assets/fanar_logo.jpg" width="200"/>
+</p>
+# Fanar-2-Oryx-IVU (Image & Video Understanding)
+**Fanar-2-Oryx-IVU** is an Arabic-first vision-language model for culturally-aware image and video understanding, developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is part of the **Fanar 2.0 release**, a comprehensive Arabic-centric multimodal generative AI platform that also includes [text generation](https://huggingface.co/QCRI/Fanar-2-27B-Instruct), [image generation](https://huggingface.co/QCRI/Fanar-2-Oryx-IG) and [poetry generation](https://huggingface.co/QCRI/Fanar-2-Diwan).
+Fanar-2-Oryx-IVU specializes in understanding images and videos with strong Arabic language support, cultural awareness, and Arabic calligraphy recognition. Trained on **62M bilingual examples** (50/50 Arabic/English), the model outperforms its base model (Qwen2.5-VL-7B) on culturally-relevant content while achieving **70% user satisfaction** and significantly reduced code-switching in Arabic responses.
+We have published a [report](https://arxiv.org/abs/2603.16397) with all the details regarding Fanar 2.0 GenAI platform. We also provide a [chat interface](https://chat.fanar.qa), mobile apps for [iOS](https://apps.apple.com/jo/app/fanar-فنار/id6741857943) and [Android](https://play.google.com/store/apps/details?id=com.fanarmobile), and [API access](https://api.fanar.qa/docs) to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).
+---
+## Model Details
+| Attribute                  | Value                              |
+|---------------------------|------------------------------------|
+| Developed by              | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/)                      |
+| Sponsored by              | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
+| Model Type                | Vision-Language Model (VLM)        |
+| Base Model                | Qwen2.5-VL-Instruct (7B)          |
+| Parameter Count           | 7 Billion                          |
+| Architecture              | Dynamic-resolution ViT + LLM       |
+| Fine-tuning Method        | LoRA (rank 128) on attention layers |
+| Vision Encoder            | Frozen during training             |
+| Input Modalities          | Images, Videos, Text               |
+| Output                    | Text (Arabic/English)              |
+| Training Framework        | LLaMAFactory                      |
+| Training Data             | 62M multimodal examples            |
+| Languages                 | Arabic, English                    |
+| License                   | Apache 2.0                         |
+---
+## Model Training
+### Training Data (62M Examples)
+Fanar-2-Oryx-IVU was trained on a comprehensive multimodal dataset with **balanced Arabic-English representation** (approximately 50/50):
+#### 1. Cultural Content (24M VQA pairs)
+- **240K internally collected images** from taxonomy-driven crawling
+- Coverage: 22 Arab countries across cultural categories
+- **Dense supervision**: Up to 63 QA pairs per image
+- **Bilingual VQA synthesis**: English + Modern Standard Arabic
+- **Null-field supervision**: Explicit "absence" questions to reduce hallucinations
+- Generated via Gemini 2.5 Flash with structured metadata
+#### 2. Arabic Fonts & Calligraphy (54K pairs)
+- **20K calligraphy images** featuring Qur'anic verses
+- **5 major Arabic scripts**: Thuluth (الثلث), Naskh (النسخ), Ruq'ah (الرقعة), Kufi (الكوفي), Diwani (الديواني)
+- Dual objectives:
+  - Content identification (transcribing Arabic text)
+  - Script classification (recognizing calligraphic style)
+- All prompts and responses are in Arabic
+#### 3. Object Detection & Localization (1.6M pairs)
+- Based on AllenAI public datasets
+- Enhanced with instance-level bounding boxes
+- **WordNet-style taxonomic expansion** for robust semantic coverage
+- **Point-based grounding**: (x,y) coordinate lists for spatial reasoning
+- **Bilingual**: 800K English + 800K Arabic (translated)
+#### 4. General Image Captioning (34M pairs)
+- **566K source images** from Pixmo dataset
+- Detailed audio-transcribed captions (high quality)
+- **27 paraphrased templates per language** to increase diversity
+- **17M English + 17M Arabic** caption pairs
+#### 5. Text-only Instruction (1.9M)
+- UltraChat in English + Arabic translation
+- Maintains dialogue capability independent of visual input
+- Supports mixed text-visual interactions in realistic deployments
+### Training Methodology
+- **Parameter-efficient fine-tuning**: LoRA (rank 128) on attention layers
+- **Vision encoder frozen**: Preserves pretrained visual representations
+- **Multi-run training with TIES merging**: Combines complementary strengths
+- **Training scale**: 16 nodes, approximately 2 weeks
+- **Language balance**: Strict 50/50 Arabic-English distribution maintained
+### Key Innovations
+- **Taxonomy-guided cultural crawling** across 22 Arab countries
+- **WordNet-style augmentation** for synonyms, hypernyms, long-tail concepts
+- **Faithfulness by design**: Null-field supervision for hallucination reduction
+- **Point-based grounding**: (x,y) coordinates for spatial understanding
+- **Native Arabic calligraphy recognition**: 5 major script styles
+---
+<!--
+## Examples
+Below are real examples demonstrating Fanar-2-Oryx-IVU's capabilities across different tasks:
+### Example 1: Calligraphy Recognition
+**Image**: *[Arabic calligraphy in Thuluth script]*
+**Question (Arabic)**: ما نوع الخط المستخدم في هذه الصورة؟
+**Fanar-2-Oryx-IVU Response**:
+> الخط المستخدم هو خط الثلث، وهو من أجمل وأشهر الخطوط العربية. يتميز بحروفه الممدودة وأشكاله المتناسقة، وغالباً ما يُستخدم في كتابة الآيات القرآنية والزخارف الإسلامية.
+*Translation: "The script used is Thuluth, one of the most beautiful and famous Arabic scripts. It is characterized by elongated letters and harmonious shapes, and is often used in writing Qur'anic verses and Islamic decorations."*
+---
+### Example 2: Cultural Understanding
+**Image**: *[Traditional Gulf wedding ceremony with guests in cultural attire]*
+**Question (English)**: What ceremony is shown in this image?
+**Fanar-2-Oryx-IVU Response**:
+> This appears to be a traditional Gulf wedding ceremony. The guests are wearing traditional attire - men in white thobes and ghutras, women in elegant abayas and hijabs. The setting shows the characteristic majlis-style seating arrangement and decorative elements typical of Qatari and Gulf wedding celebrations. The formal yet celebratory atmosphere is evident from the gathering and the traditional dress code being observed.
+---
+### Example 3: Object Detection & Counting
+**Image**: *[Market scene with multiple people and objects]*
+**Question (Arabic)**: كم عدد الأشخاص في الصورة؟ وأين يقفون؟
+**Fanar-2-Oryx-IVU Response**:
+> في الصورة خمسة أشخاص. ثلاثة منهم يقفون بالقرب من العربة على اليمين [(0.65, 0.42), (0.71, 0.45), (0.68, 0.52)]، وشخصان آخران على اليسار بالقرب من المحل [(0.23, 0.38), (0.28, 0.41)]. يبدو أنهم في سوق تقليدي أو سوق شعبي.
+*Translation: "There are five people in the image. Three of them are standing near the cart on the right [coordinates], and two others on the left near the shop [coordinates]. They appear to be in a traditional or popular market."*
+---
+### Example 4: Landmark Recognition
+**Image**: *[Museum of Islamic Art, Doha]*
+**Question**: What is this building?
+**Fanar-2-Oryx-IVU Response**:
+> This is the Museum of Islamic Art (MIA) in Doha, Qatar. Designed by renowned architect I.M. Pei and opened in 2008, it is one of the world's most comprehensive collections of Islamic art. The distinctive modernist architecture features geometric Islamic patterns and sits on an artificial peninsula overlooking the Doha Corniche. The museum houses artifacts spanning 1,400 years from three continents and is a major cultural landmark in Qatar.
+---
+### Example 5: Dialect Understanding
+**Image**: *[Street sign in Arabic]*
+**Question (Egyptian dialect)**: إيه المكتوب في اللافتة دي؟
+**Fanar-2-Oryx-IVU Response**:
+> اللافتة مكتوب عليها "شارع الاستقلال" بالعربي. دي لافتة شارع واضحة بخط كبير، ومكتوبة بالعربية الفصحى.
+*Translation: "The sign says 'Independence Street' in Arabic. It's a clear street sign in large font, written in Modern Standard Arabic."*
+---
+### Example 6: General Captioning
+**Image**: *[Traditional Qatari family gathering]*
+**Question (Arabic)**: اوصف ما تراه في هذه الصورة
+**Fanar-2-Oryx-IVU Response**:
+> الصورة تُظهر عائلة قطرية مجتمعة في مجلس تقليدي. الرجال يرتدون الثوب الأبيض والغترة، بينما النساء يرتدين العباءات والحجاب. الجلسة مُرتبة بأسلوب المجلس العربي التقليدي مع الوسائد والسجاد. في الوسط يوجد دلة القهوة العربية وأكواب صغيرة، مما يُشير إلى جلسة ضيافة تقليدية. الأجواء العائلية الدافئة واضحة من خلال التجمع والترتيب التقليدي للمكان.
+*Translation: "The image shows a Qatari family gathered in a traditional majlis. Men wear white thobes and ghutras, while women wear abayas and hijabs. The seating is arranged in traditional Arab majlis style with cushions and carpets. In the center is an Arabic coffee dallah and small cups, indicating a traditional hospitality session. The warm family atmosphere is evident through the gathering and traditional arrangement of the space."*
+---
+-->
+## Custom Evaluation Benchmarks
+Fanar-2-Oryx-IVU was evaluated on multiple custom benchmarks designed specifically for Arabic cultural and linguistic assessment:
+### 1. Oryx-Almieyar (12K questions)
+- **200 images** (10 per country, 20 Arab countries)
+- **30 dialect experts** for manual annotation
+- **Three language variants**: English, MSA, country-specific dialects
+- **Country-level diagnostic analysis** for geographic coverage
+### 2. Oryx-BloomBench (7,747 pairs)
+- **Bilingual** (English/Arabic)
+- **6 Bloom's taxonomy levels**:
+  - Remember (2,948)
+  - Understand (1,592)
+  - Analyze (1,431)
+  - Create (685)
+  - Evaluate (592)
+  - Apply (499)
+- Tests reasoning depth beyond surface perception
+### 3. TaskGalaxy Subset (12K samples)
+- Broad regression test for general capabilities
+- **19,227 hierarchical vision task types**
+- Bilingual Arabic/English
+- Prevents capability degradation during Arabic optimization
+---
+## Getting Started
+Oryx-IVU is compatible with the Hugging Face `transformers` library. Here's how to load and use the model:
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from PIL import Image
+import torch
+model_name = "QCRI/Fanar-2-Oryx-IVU"
+# Load model and processor
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_name)
+# Load image
+image = Image.open("path/to/image.jpg")
+# Prepare conversation (supports Arabic or English)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "ما الذي تراه في هذه الصورة؟"}
+        ]
+    }
+]
+# Process and generate
+text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(
+    text=[text_prompt],
+    images=[image],
+    padding=True,
+    return_tensors="pt"
+).to(model.device)
+output_ids = model.generate(**inputs, max_new_tokens=256)
+generated_text = processor.batch_decode(
+    output_ids,
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=True
+)
+print(generated_text[0])
+```
+### Multi-turn Conversation
+```python
+# First turn
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "What is this landmark?"}
+        ]
+    }
+]
+# ... generate response ...
+# Second turn (building on context)
+messages.append({"role": "assistant", "content": generated_text[0]})
+messages.append({
+    "role": "user",
+    "content": [{"type": "text", "text": "Tell me more about its history"}]
+})
+# ... generate response ...
+```
+---
+## Evaluation
+### Multiple-Choice Benchmarks (Arabic)
+| Model | Arabic Culture | CamelBench | BloomBench | TaskGalaxy |
+|-------|----------------|------------|------------|------------|
+| **Fanar-2-Oryx-IVU** | **48.0%** | **45.0%** | **58.0%** | **74.0%** |
+| Qwen2.5-VL (base) | **48.0%** | **45.0%** | **58.0%** | **74.0%** |
+| Gemma-3-12B | 40.0% | 50.0% | 48.0% | 20.0% |
+| Qwen2-VL-7B | 30.0% | 41.0% | 37.0% | 51.0% |
+| AIN-7B | 33.0% | 45.0% | 45.0% | 61.0% |
+*Note: Oryx-IVU matches base model on MCQ but excels in generation quality and Arabic coherence.*
+### Generative Evaluation (LLM-as-a-Judge, 1-5 scale)
+Evaluated on **3,300 real user queries** with Gemini 2.5 Flash as judge:
+| Model | Average Score | Comments |
+|-------|---------------|----------|
+| GPT-4o | 4.51 | Strongest overall |
+| **Fanar-2-Oryx-IVU** | **3.03** | **Best among similar-sized models** |
+| Qwen3-VL | 2.96 | Newer but lower quality |
+| Qwen2.5-VL (base) | 2.76 | Our base model |
+| Qwen2-VL | 2.21 | Older version |
+| AIN-7B | 2.23 | Similar size competitor |
+**Key Achievements:**
+- Outperforms base model by **+0.27 points** (10% relative improvement)
+- Outperforms newer Qwen3-VL despite being based on older Qwen2.5
+- Best among all tested 7B-class models
+### Language Consistency Improvements
+| Metric | Base Model (Qwen2.5-VL) | Fanar-2-Oryx-IVU | Improvement |
+|--------|-------------------------|----------|-------------|
+| Arabic-English code-switching | 11% | 6% | **-45% reduction** |
+| Arabic-Chinese mixing | 3% | 1.5% | **-50% reduction** |
+### User Satisfaction (3,300 queries)
+| Rating | Percentage |
+|--------|-----------|
+| Like | **70%** |
+| Dislike | 25% |
+| No Reaction | 5% |
+### Cultural Domain Excellence
+Fanar-2-Oryx-IVU achieves leading performance in culturally-sensitive categories:
+- **Food & Drink**: Top performer
+- **Islamic Culture**: Top performer
+- **Landmarks**: Top performer
+- **Country-specific content**: Best for Algeria, Jordan, Palestine, Qatar, Sudan
+---
+## Intended Use, Limitations & Ethical Considerations
+Fanar-2-Oryx-IVU is built for:
+- **Cultural heritage documentation** and preservation
+- **Educational applications** teaching Arabic culture and history
+- **Accessibility tools** for Arabic-speaking visually impaired users
+- **Content moderation** for Arabic social media platforms
+- **E-commerce** product description generation in Arabic
+- **Museum and tourism** applications with multilingual support
+- **Calligraphy and document analysis** for historical texts
+- **Research** on Arabic vision-language understanding
+**Limitations:**
+- May produce hallucinations despite mitigation strategies
+- Arabic text recognition in images remains challenging
+- Performance varies across different Arabic dialects
+- May reflect biases present in training data
+- Cannot perfectly understand all cultural nuances
+**Recommendations:**
+- Verify critical information from generated responses
+- Use human review for sensitive applications
+- Provide user feedback mechanisms
+- Monitor for cultural appropriateness, hallucinations and errors
+- Consider fine-tuning for domain-specific needs
+- Implement fallback mechanisms for uncertain responses
+**Not Suitable For:**
+- Medical diagnosis or legal advice
+- High-stakes decision-making
+- Situations requiring perfect accuracy
+- Replacing human judgment in cultural matters
+- Surveillance applications
+Kindly refer to our [Terms of Service](https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).
+The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT, or any other organization or individual.
+---
+## Fanar Platform
+While Fanar-2-27B-Instruct is a powerful standalone model, it is part of the broader **Fanar Platform**—an integrated Arabic-centric multimodal AI ecosystem that provides enhanced capabilities and continuous updates. The platform includes:
+**Core Capabilities:**
+- **Text Generation**: Multiple conversational models optimized for different tasks
+- **Speech (Aura)**: Speech-to-text (short-form and long-form) and text-to-speech synthesis with Arabic dialect support and bilingual Arabic-English capabilities
+- **Image Understanding (Oryx-IVU)**: Vision-language model for culturally-grounded image and video understanding including Arabic calligraphy recognition
+- **Image Generation (Oryx-IG)**: Culturally-aligned text-to-image generation trained on taxonomy-driven data across 23,000+ cultural search terms
+- **Machine Translation (FanarShaheen)**: High-quality bilingual Arabic↔English translation across diverse domains (e.g., news, STEM, and medical)
+- **Poetry Generation (Diwan)**: Classical Arabic poetry generation respecting prosodic meters (Buhur) and maintaining diacritization accuracy
+**Specialized Systems:**
+- **Fanar-Sadiq**: Multi-agent Islamic question-answering system with 9 specialized tools (Fiqh reasoning, Quran/Hadith retrieval, zakat/inheritance calculation, prayer times, and Hijri calendar). Deployed in production on [IslamWeb](https://islamweb.net) and [IslamOnline](https://islamonline.net) platforms.
+- **Safety & Moderation**: Fanar-Guard and culturally-informed content filtering trained on 468K annotated Arabic-English safety examples
+**Access Points:**
+- **[Fanar Chat](https://chat.fanar.qa)**: Web conversational interface integrating all modalities
+- **[iOS](https://apps.apple.com/jo/app/fanar-فنار/id6741857943) and [Android](https://play.google.com/store/apps/details?id=com.fanarmobile) apps**: Mobile apps for on-the-go access to the Fanar Platform
+- **[Fanar API](https://api.fanar.qa)**: Programmatic access to models and specialized capabilities
+The Fanar Platform continuously evolves with model updates, new capabilities, and improved safety mechanisms. For production deployments requiring the latest features, multimodal integration, cross-model orchestration, and ongoing support, we recommend using the [Fanar Platform](https://fanar.qa) rather than the standalone models published here.
+---
+## Citation
+If you use Fanar-2-Oryx-IVU or the Fanar 2.0 GenAI platform in your research or applications, please cite:
+```bibtex
+@misc{fanarteam2026fanar20arabicgenerative,
+      title={Fanar 2.0: Arabic Generative AI Stack},
+      author={FANAR TEAM and Ummar Abbas and Mohammad Shahmeer Ahmad and Minhaj Ahmad and Abdulaziz Al-Homaid and Anas Al-Nuaimi and Enes Altinisik and Ehsaneddin Asgari and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Asim Ersoy and Masoomali Fatehkia and Mohammed Qusay Hashim and Majd Hawasly and Mohamed Hefeeda and Mus'ab Husaini and Keivin Isufaj and Soon-Gyo Jung and Houssam Lachemat and Ji Kim Lucas and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Mourad Ouzzani and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang},
+      year={2026},
+      eprint={2603.16397},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2603.16397},
+}
+```
+---
+## Acknowledgements
+This project is from [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
+Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure needed to develop and serve the platform through the Google Cloud Platform.
+---
+## License
+This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).