Fanar-2-Oryx-IVU (Image & Video Understanding)
Fanar-2-Oryx-IVU is an Arabic-first vision-language model for culturally-aware image and video understanding, developed by Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation for Education, Science, and Community Development. It is part of the Fanar 2.0 release, a comprehensive Arabic-centric multimodal generative AI platform that also includes text generation, image generation and poetry generation.
Fanar-2-Oryx-IVU specializes in understanding images and videos with strong Arabic language support, cultural awareness, and Arabic calligraphy recognition. Trained on 62M bilingual examples (50/50 Arabic/English), the model outperforms its base model (Qwen2.5-VL-7B) on culturally-relevant content while achieving 70% user satisfaction and significantly reduced code-switching in Arabic responses.
We have published a report with all the details regarding Fanar 2.0 GenAI platform. We also provide a chat interface, mobile apps for iOS and Android, and API access to our models and the GenAI platform (request access here).
Model Details
| Attribute | Value |
|---|---|
| Developed by | QCRI at HBKU |
| Sponsored by | Ministry of Communications and Information Technology, State of Qatar |
| Model Type | Vision-Language Model (VLM) |
| Base Model | Qwen2.5-VL-Instruct (7B) |
| Parameter Count | 7 Billion |
| Architecture | Dynamic-resolution ViT + LLM |
| Fine-tuning Method | LoRA (rank 128) on attention layers |
| Vision Encoder | Frozen during training |
| Input Modalities | Images, Videos, Text |
| Output | Text (Arabic/English) |
| Training Framework | LLaMAFactory |
| Training Data | 62M multimodal examples |
| Languages | Arabic, English |
| License | Apache 2.0 |
Model Training
Training Data (62M Examples)
Fanar-2-Oryx-IVU was trained on a comprehensive multimodal dataset with balanced Arabic-English representation (approximately 50/50):
1. Cultural Content (24M VQA pairs)
- 240K internally collected images from taxonomy-driven crawling
- Coverage: 22 Arab countries across cultural categories
- Dense supervision: Up to 63 QA pairs per image
- Bilingual VQA synthesis: English + Modern Standard Arabic
- Null-field supervision: Explicit "absence" questions to reduce hallucinations
- Generated via Gemini 2.5 Flash with structured metadata
2. Arabic Fonts & Calligraphy (54K pairs)
- 20K calligraphy images featuring Qur'anic verses
- 5 major Arabic scripts: Thuluth (الثلث), Naskh (النسخ), Ruq'ah (الرقعة), Kufi (الكوفي), Diwani (الديواني)
- Dual objectives:
- Content identification (transcribing Arabic text)
- Script classification (recognizing calligraphic style)
- All prompts and responses are in Arabic
3. Object Detection & Localization (1.6M pairs)
- Based on AllenAI public datasets
- Enhanced with instance-level bounding boxes
- WordNet-style taxonomic expansion for robust semantic coverage
- Point-based grounding: (x,y) coordinate lists for spatial reasoning
- Bilingual: 800K English + 800K Arabic (translated)
4. General Image Captioning (34M pairs)
- 566K source images from Pixmo dataset
- Detailed audio-transcribed captions (high quality)
- 27 paraphrased templates per language to increase diversity
- 17M English + 17M Arabic caption pairs
5. Text-only Instruction (1.9M)
- UltraChat in English + Arabic translation
- Maintains dialogue capability independent of visual input
- Supports mixed text-visual interactions in realistic deployments
Training Methodology
- Parameter-efficient fine-tuning: LoRA (rank 128) on attention layers
- Vision encoder frozen: Preserves pretrained visual representations
- Multi-run training with TIES merging: Combines complementary strengths
- Training scale: 16 nodes, approximately 2 weeks
- Language balance: Strict 50/50 Arabic-English distribution maintained
Key Innovations
- Taxonomy-guided cultural crawling across 22 Arab countries
- WordNet-style augmentation for synonyms, hypernyms, long-tail concepts
- Faithfulness by design: Null-field supervision for hallucination reduction
- Point-based grounding: (x,y) coordinates for spatial understanding
- Native Arabic calligraphy recognition: 5 major script styles
Custom Evaluation Benchmarks
Fanar-2-Oryx-IVU was evaluated on multiple custom benchmarks designed specifically for Arabic cultural and linguistic assessment:
1. Oryx-Almieyar (12K questions)
- 200 images (10 per country, 20 Arab countries)
- 30 dialect experts for manual annotation
- Three language variants: English, MSA, country-specific dialects
- Country-level diagnostic analysis for geographic coverage
2. Oryx-BloomBench (7,747 pairs)
- Bilingual (English/Arabic)
- 6 Bloom's taxonomy levels:
- Remember (2,948)
- Understand (1,592)
- Analyze (1,431)
- Create (685)
- Evaluate (592)
- Apply (499)
- Tests reasoning depth beyond surface perception
3. TaskGalaxy Subset (12K samples)
- Broad regression test for general capabilities
- 19,227 hierarchical vision task types
- Bilingual Arabic/English
- Prevents capability degradation during Arabic optimization
Getting Started
Oryx-IVU is compatible with the Hugging Face transformers library. Here's how to load and use the model:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import torch
model_name = "QCRI/Fanar-2-Oryx-IVU"
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Load image
image = Image.open("path/to/image.jpg")
# Prepare conversation (supports Arabic or English)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "ما الذي تراه في هذه الصورة؟"}
]
}
]
# Process and generate
text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=[text_prompt],
images=[image],
padding=True,
return_tensors="pt"
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_text = processor.batch_decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)
print(generated_text[0])
Multi-turn Conversation
# First turn
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is this landmark?"}
]
}
]
# ... generate response ...
# Second turn (building on context)
messages.append({"role": "assistant", "content": generated_text[0]})
messages.append({
"role": "user",
"content": [{"type": "text", "text": "Tell me more about its history"}]
})
# ... generate response ...
Evaluation
Multiple-Choice Benchmarks (Arabic)
| Model | Arabic Culture | CamelBench | BloomBench | TaskGalaxy |
|---|---|---|---|---|
| Fanar-2-Oryx-IVU | 48.0% | 45.0% | 58.0% | 74.0% |
| Qwen2.5-VL (base) | 48.0% | 45.0% | 58.0% | 74.0% |
| Gemma-3-12B | 40.0% | 50.0% | 48.0% | 20.0% |
| Qwen2-VL-7B | 30.0% | 41.0% | 37.0% | 51.0% |
| AIN-7B | 33.0% | 45.0% | 45.0% | 61.0% |
Note: Oryx-IVU matches base model on MCQ but excels in generation quality and Arabic coherence.
Generative Evaluation (LLM-as-a-Judge, 1-5 scale)
Evaluated on 3,300 real user queries with Gemini 2.5 Flash as judge:
| Model | Average Score | Comments |
|---|---|---|
| GPT-4o | 4.51 | Strongest overall |
| Fanar-2-Oryx-IVU | 3.03 | Best among similar-sized models |
| Qwen3-VL | 2.96 | Newer but lower quality |
| Qwen2.5-VL (base) | 2.76 | Our base model |
| Qwen2-VL | 2.21 | Older version |
| AIN-7B | 2.23 | Similar size competitor |
Key Achievements:
- Outperforms base model by +0.27 points (10% relative improvement)
- Outperforms newer Qwen3-VL despite being based on older Qwen2.5
- Best among all tested 7B-class models
Language Consistency Improvements
| Metric | Base Model (Qwen2.5-VL) | Fanar-2-Oryx-IVU | Improvement |
|---|---|---|---|
| Arabic-English code-switching | 11% | 6% | -45% reduction |
| Arabic-Chinese mixing | 3% | 1.5% | -50% reduction |
User Satisfaction (3,300 queries)
| Rating | Percentage |
|---|---|
| Like | 70% |
| Dislike | 25% |
| No Reaction | 5% |
Cultural Domain Excellence
Fanar-2-Oryx-IVU achieves leading performance in culturally-sensitive categories:
- Food & Drink: Top performer
- Islamic Culture: Top performer
- Landmarks: Top performer
- Country-specific content: Best for Algeria, Jordan, Palestine, Qatar, Sudan
Intended Use, Limitations & Ethical Considerations
Fanar-2-Oryx-IVU is built for:
- Cultural heritage documentation and preservation
- Educational applications teaching Arabic culture and history
- Accessibility tools for Arabic-speaking visually impaired users
- Content moderation for Arabic social media platforms
- E-commerce product description generation in Arabic
- Museum and tourism applications with multilingual support
- Calligraphy and document analysis for historical texts
- Research on Arabic vision-language understanding
Limitations:
- May produce hallucinations despite mitigation strategies
- Arabic text recognition in images remains challenging
- Performance varies across different Arabic dialects
- May reflect biases present in training data
- Cannot perfectly understand all cultural nuances
Recommendations:
- Verify critical information from generated responses
- Use human review for sensitive applications
- Provide user feedback mechanisms
- Monitor for cultural appropriateness, hallucinations and errors
- Consider fine-tuning for domain-specific needs
- Implement fallback mechanisms for uncertain responses
Not Suitable For:
- Medical diagnosis or legal advice
- High-stakes decision-making
- Situations requiring perfect accuracy
- Replacing human judgment in cultural matters
- Surveillance applications
Kindly refer to our Terms of Service and Privacy Policy.
The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT, or any other organization or individual.
Fanar Platform
While Fanar-2-27B-Instruct is a powerful standalone model, it is part of the broader Fanar Platform—an integrated Arabic-centric multimodal AI ecosystem that provides enhanced capabilities and continuous updates. The platform includes:
Core Capabilities:
- Text Generation: Multiple conversational models optimized for different tasks
- Speech (Aura): Speech-to-text (short-form and long-form) and text-to-speech synthesis with Arabic dialect support and bilingual Arabic-English capabilities
- Image Understanding (Oryx-IVU): Vision-language model for culturally-grounded image and video understanding including Arabic calligraphy recognition
- Image Generation (Oryx-IG): Culturally-aligned text-to-image generation trained on taxonomy-driven data across 23,000+ cultural search terms
- Machine Translation (FanarShaheen): High-quality bilingual Arabic↔English translation across diverse domains (e.g., news, STEM, and medical)
- Poetry Generation (Diwan): Classical Arabic poetry generation respecting prosodic meters (Buhur) and maintaining diacritization accuracy
Specialized Systems:
- Fanar-Sadiq: Multi-agent Islamic question-answering system with 9 specialized tools (Fiqh reasoning, Quran/Hadith retrieval, zakat/inheritance calculation, prayer times, and Hijri calendar). Deployed in production on IslamWeb and IslamOnline platforms.
- Safety & Moderation: Fanar-Guard and culturally-informed content filtering trained on 468K annotated Arabic-English safety examples
Access Points:
- Fanar Chat: Web conversational interface integrating all modalities
- iOS and Android apps: Mobile apps for on-the-go access to the Fanar Platform
- Fanar API: Programmatic access to models and specialized capabilities
The Fanar Platform continuously evolves with model updates, new capabilities, and improved safety mechanisms. For production deployments requiring the latest features, multimodal integration, cross-model orchestration, and ongoing support, we recommend using the Fanar Platform rather than the standalone models published here.
Citation
If you use Fanar-2-Oryx-IVU or the Fanar 2.0 GenAI platform in your research or applications, please cite:
@misc{fanarteam2026fanar20arabicgenerative,
title={Fanar 2.0: Arabic Generative AI Stack},
author={FANAR TEAM and Ummar Abbas and Mohammad Shahmeer Ahmad and Minhaj Ahmad and Abdulaziz Al-Homaid and Anas Al-Nuaimi and Enes Altinisik and Ehsaneddin Asgari and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Asim Ersoy and Masoomali Fatehkia and Mohammed Qusay Hashim and Majd Hawasly and Mohamed Hefeeda and Mus'ab Husaini and Keivin Isufaj and Soon-Gyo Jung and Houssam Lachemat and Ji Kim Lucas and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Mourad Ouzzani and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang},
year={2026},
eprint={2603.16397},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.16397},
}
Acknowledgements
This project is from Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
Special thanks to the Ministry of Communications and Information Technology, State of Qatar for their continued support by providing the compute infrastructure needed to develop and serve the platform through the Google Cloud Platform.
License
This model is licensed under the Apache 2.0 License.
- Downloads last month
- 12