Fanar-2-Oryx-IVU (Image & Video Understanding)

Fanar-2-Oryx-IVU is an Arabic-first vision-language model for culturally-aware image and video understanding, developed by Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation for Education, Science, and Community Development. It is part of the Fanar 2.0 release, a comprehensive Arabic-centric multimodal generative AI platform that also includes text generation, image generation and poetry generation.

Fanar-2-Oryx-IVU specializes in understanding images and videos with strong Arabic language support, cultural awareness, and Arabic calligraphy recognition. Trained on 62M bilingual examples (50/50 Arabic/English), the model outperforms its base model (Qwen2.5-VL-7B) on culturally-relevant content while achieving 70% user satisfaction and significantly reduced code-switching in Arabic responses.

We have published a report with all the details regarding Fanar 2.0 GenAI platform. We also provide a chat interface, mobile apps for iOS and Android, and API access to our models and the GenAI platform (request access here).


Model Details

Attribute Value
Developed by QCRI at HBKU
Sponsored by Ministry of Communications and Information Technology, State of Qatar
Model Type Vision-Language Model (VLM)
Base Model Qwen2.5-VL-Instruct (7B)
Parameter Count 7 Billion
Architecture Dynamic-resolution ViT + LLM
Fine-tuning Method LoRA (rank 128) on attention layers
Vision Encoder Frozen during training
Input Modalities Images, Videos, Text
Output Text (Arabic/English)
Training Framework LLaMAFactory
Training Data 62M multimodal examples
Languages Arabic, English
License Apache 2.0

Model Training

Training Data (62M Examples)

Fanar-2-Oryx-IVU was trained on a comprehensive multimodal dataset with balanced Arabic-English representation (approximately 50/50):

1. Cultural Content (24M VQA pairs)

  • 240K internally collected images from taxonomy-driven crawling
  • Coverage: 22 Arab countries across cultural categories
  • Dense supervision: Up to 63 QA pairs per image
  • Bilingual VQA synthesis: English + Modern Standard Arabic
  • Null-field supervision: Explicit "absence" questions to reduce hallucinations
  • Generated via Gemini 2.5 Flash with structured metadata

2. Arabic Fonts & Calligraphy (54K pairs)

  • 20K calligraphy images featuring Qur'anic verses
  • 5 major Arabic scripts: Thuluth (الثلث), Naskh (النسخ), Ruq'ah (الرقعة), Kufi (الكوفي), Diwani (الديواني)
  • Dual objectives:
    • Content identification (transcribing Arabic text)
    • Script classification (recognizing calligraphic style)
  • All prompts and responses are in Arabic

3. Object Detection & Localization (1.6M pairs)

  • Based on AllenAI public datasets
  • Enhanced with instance-level bounding boxes
  • WordNet-style taxonomic expansion for robust semantic coverage
  • Point-based grounding: (x,y) coordinate lists for spatial reasoning
  • Bilingual: 800K English + 800K Arabic (translated)

4. General Image Captioning (34M pairs)

  • 566K source images from Pixmo dataset
  • Detailed audio-transcribed captions (high quality)
  • 27 paraphrased templates per language to increase diversity
  • 17M English + 17M Arabic caption pairs

5. Text-only Instruction (1.9M)

  • UltraChat in English + Arabic translation
  • Maintains dialogue capability independent of visual input
  • Supports mixed text-visual interactions in realistic deployments

Training Methodology

  • Parameter-efficient fine-tuning: LoRA (rank 128) on attention layers
  • Vision encoder frozen: Preserves pretrained visual representations
  • Multi-run training with TIES merging: Combines complementary strengths
  • Training scale: 16 nodes, approximately 2 weeks
  • Language balance: Strict 50/50 Arabic-English distribution maintained

Key Innovations

  • Taxonomy-guided cultural crawling across 22 Arab countries
  • WordNet-style augmentation for synonyms, hypernyms, long-tail concepts
  • Faithfulness by design: Null-field supervision for hallucination reduction
  • Point-based grounding: (x,y) coordinates for spatial understanding
  • Native Arabic calligraphy recognition: 5 major script styles

Custom Evaluation Benchmarks

Fanar-2-Oryx-IVU was evaluated on multiple custom benchmarks designed specifically for Arabic cultural and linguistic assessment:

1. Oryx-Almieyar (12K questions)

  • 200 images (10 per country, 20 Arab countries)
  • 30 dialect experts for manual annotation
  • Three language variants: English, MSA, country-specific dialects
  • Country-level diagnostic analysis for geographic coverage

2. Oryx-BloomBench (7,747 pairs)

  • Bilingual (English/Arabic)
  • 6 Bloom's taxonomy levels:
    • Remember (2,948)
    • Understand (1,592)
    • Analyze (1,431)
    • Create (685)
    • Evaluate (592)
    • Apply (499)
  • Tests reasoning depth beyond surface perception

3. TaskGalaxy Subset (12K samples)

  • Broad regression test for general capabilities
  • 19,227 hierarchical vision task types
  • Bilingual Arabic/English
  • Prevents capability degradation during Arabic optimization

Getting Started

Oryx-IVU is compatible with the Hugging Face transformers library. Here's how to load and use the model:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import torch

model_name = "QCRI/Fanar-2-Oryx-IVU"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Load image
image = Image.open("path/to/image.jpg")

# Prepare conversation (supports Arabic or English)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "ما الذي تراه في هذه الصورة؟"}
        ]
    }
]

# Process and generate
text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=256)
generated_text = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True
)

print(generated_text[0])

Multi-turn Conversation

# First turn
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is this landmark?"}
        ]
    }
]

# ... generate response ...

# Second turn (building on context)
messages.append({"role": "assistant", "content": generated_text[0]})
messages.append({
    "role": "user",
    "content": [{"type": "text", "text": "Tell me more about its history"}]
})

# ... generate response ...

Evaluation

Multiple-Choice Benchmarks (Arabic)

Model Arabic Culture CamelBench BloomBench TaskGalaxy
Fanar-2-Oryx-IVU 48.0% 45.0% 58.0% 74.0%
Qwen2.5-VL (base) 48.0% 45.0% 58.0% 74.0%
Gemma-3-12B 40.0% 50.0% 48.0% 20.0%
Qwen2-VL-7B 30.0% 41.0% 37.0% 51.0%
AIN-7B 33.0% 45.0% 45.0% 61.0%

Note: Oryx-IVU matches base model on MCQ but excels in generation quality and Arabic coherence.

Generative Evaluation (LLM-as-a-Judge, 1-5 scale)

Evaluated on 3,300 real user queries with Gemini 2.5 Flash as judge:

Model Average Score Comments
GPT-4o 4.51 Strongest overall
Fanar-2-Oryx-IVU 3.03 Best among similar-sized models
Qwen3-VL 2.96 Newer but lower quality
Qwen2.5-VL (base) 2.76 Our base model
Qwen2-VL 2.21 Older version
AIN-7B 2.23 Similar size competitor

Key Achievements:

  • Outperforms base model by +0.27 points (10% relative improvement)
  • Outperforms newer Qwen3-VL despite being based on older Qwen2.5
  • Best among all tested 7B-class models

Language Consistency Improvements

Metric Base Model (Qwen2.5-VL) Fanar-2-Oryx-IVU Improvement
Arabic-English code-switching 11% 6% -45% reduction
Arabic-Chinese mixing 3% 1.5% -50% reduction

User Satisfaction (3,300 queries)

Rating Percentage
Like 70%
Dislike 25%
No Reaction 5%

Cultural Domain Excellence

Fanar-2-Oryx-IVU achieves leading performance in culturally-sensitive categories:

  • Food & Drink: Top performer
  • Islamic Culture: Top performer
  • Landmarks: Top performer
  • Country-specific content: Best for Algeria, Jordan, Palestine, Qatar, Sudan

Intended Use, Limitations & Ethical Considerations

Fanar-2-Oryx-IVU is built for:

  • Cultural heritage documentation and preservation
  • Educational applications teaching Arabic culture and history
  • Accessibility tools for Arabic-speaking visually impaired users
  • Content moderation for Arabic social media platforms
  • E-commerce product description generation in Arabic
  • Museum and tourism applications with multilingual support
  • Calligraphy and document analysis for historical texts
  • Research on Arabic vision-language understanding

Limitations:

  • May produce hallucinations despite mitigation strategies
  • Arabic text recognition in images remains challenging
  • Performance varies across different Arabic dialects
  • May reflect biases present in training data
  • Cannot perfectly understand all cultural nuances

Recommendations:

  • Verify critical information from generated responses
  • Use human review for sensitive applications
  • Provide user feedback mechanisms
  • Monitor for cultural appropriateness, hallucinations and errors
  • Consider fine-tuning for domain-specific needs
  • Implement fallback mechanisms for uncertain responses

Not Suitable For:

  • Medical diagnosis or legal advice
  • High-stakes decision-making
  • Situations requiring perfect accuracy
  • Replacing human judgment in cultural matters
  • Surveillance applications

Kindly refer to our Terms of Service and Privacy Policy.

The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT, or any other organization or individual.


Fanar Platform

While Fanar-2-27B-Instruct is a powerful standalone model, it is part of the broader Fanar Platform—an integrated Arabic-centric multimodal AI ecosystem that provides enhanced capabilities and continuous updates. The platform includes:

Core Capabilities:

  • Text Generation: Multiple conversational models optimized for different tasks
  • Speech (Aura): Speech-to-text (short-form and long-form) and text-to-speech synthesis with Arabic dialect support and bilingual Arabic-English capabilities
  • Image Understanding (Oryx-IVU): Vision-language model for culturally-grounded image and video understanding including Arabic calligraphy recognition
  • Image Generation (Oryx-IG): Culturally-aligned text-to-image generation trained on taxonomy-driven data across 23,000+ cultural search terms
  • Machine Translation (FanarShaheen): High-quality bilingual Arabic↔English translation across diverse domains (e.g., news, STEM, and medical)
  • Poetry Generation (Diwan): Classical Arabic poetry generation respecting prosodic meters (Buhur) and maintaining diacritization accuracy

Specialized Systems:

  • Fanar-Sadiq: Multi-agent Islamic question-answering system with 9 specialized tools (Fiqh reasoning, Quran/Hadith retrieval, zakat/inheritance calculation, prayer times, and Hijri calendar). Deployed in production on IslamWeb and IslamOnline platforms.
  • Safety & Moderation: Fanar-Guard and culturally-informed content filtering trained on 468K annotated Arabic-English safety examples

Access Points:

  • Fanar Chat: Web conversational interface integrating all modalities
  • iOS and Android apps: Mobile apps for on-the-go access to the Fanar Platform
  • Fanar API: Programmatic access to models and specialized capabilities

The Fanar Platform continuously evolves with model updates, new capabilities, and improved safety mechanisms. For production deployments requiring the latest features, multimodal integration, cross-model orchestration, and ongoing support, we recommend using the Fanar Platform rather than the standalone models published here.


Citation

If you use Fanar-2-Oryx-IVU or the Fanar 2.0 GenAI platform in your research or applications, please cite:

@misc{fanarteam2026fanar20arabicgenerative,
      title={Fanar 2.0: Arabic Generative AI Stack}, 
      author={FANAR TEAM and Ummar Abbas and Mohammad Shahmeer Ahmad and Minhaj Ahmad and Abdulaziz Al-Homaid and Anas Al-Nuaimi and Enes Altinisik and Ehsaneddin Asgari and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Asim Ersoy and Masoomali Fatehkia and Mohammed Qusay Hashim and Majd Hawasly and Mohamed Hefeeda and Mus'ab Husaini and Keivin Isufaj and Soon-Gyo Jung and Houssam Lachemat and Ji Kim Lucas and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Mourad Ouzzani and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang},
      year={2026},
      eprint={2603.16397},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.16397}, 
}

Acknowledgements

This project is from Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.

Special thanks to the Ministry of Communications and Information Technology, State of Qatar for their continued support by providing the compute infrastructure needed to develop and serve the platform through the Google Cloud Platform.


License

This model is licensed under the Apache 2.0 License.

Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for QCRI/Fanar-2-Oryx-IVU

Finetuned
(1018)
this model
Quantizations
2 models

Collection including QCRI/Fanar-2-Oryx-IVU

Paper for QCRI/Fanar-2-Oryx-IVU