Instructions to use QCRI/Fanar-2-Oryx-IVU with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QCRI/Fanar-2-Oryx-IVU with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="QCRI/Fanar-2-Oryx-IVU")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("QCRI/Fanar-2-Oryx-IVU")
model = AutoModelForMultimodalLM.from_pretrained("QCRI/Fanar-2-Oryx-IVU")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QCRI/Fanar-2-Oryx-IVU with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QCRI/Fanar-2-Oryx-IVU"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QCRI/Fanar-2-Oryx-IVU",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/QCRI/Fanar-2-Oryx-IVU

SGLang

How to use QCRI/Fanar-2-Oryx-IVU with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QCRI/Fanar-2-Oryx-IVU" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QCRI/Fanar-2-Oryx-IVU",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QCRI/Fanar-2-Oryx-IVU" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QCRI/Fanar-2-Oryx-IVU",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use QCRI/Fanar-2-Oryx-IVU with Docker Model Runner:
```
docker model run hf.co/QCRI/Fanar-2-Oryx-IVU
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Fanar-2-Oryx-IVU (Image & Video Understanding)

Fanar-2-Oryx-IVU is an Arabic-first vision-language model for culturally-aware image and video understanding, developed by Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation for Education, Science, and Community Development. It is part of the Fanar 2.0 release, a comprehensive Arabic-centric multimodal generative AI platform that also includes text generation, image generation and poetry generation.

Fanar-2-Oryx-IVU specializes in understanding images and videos with strong Arabic language support, cultural awareness, and Arabic calligraphy recognition. Trained on 62M bilingual examples (50/50 Arabic/English), the model outperforms its base model (Qwen2.5-VL-7B) on culturally-relevant content while achieving 70% user satisfaction and significantly reduced code-switching in Arabic responses.

We have published a report with all the details regarding Fanar 2.0 GenAI platform. We also provide a chat interface, mobile apps for iOS and Android, and API access to our models and the GenAI platform (request access here).

Model Details

Attribute	Value
Developed by	QCRI at HBKU
Sponsored by	Ministry of Communications and Information Technology, State of Qatar
Model Type	Vision-Language Model (VLM)
Base Model	Qwen2.5-VL-Instruct (7B)
Parameter Count	7 Billion
Architecture	Dynamic-resolution ViT + LLM
Fine-tuning Method	LoRA (rank 128) on attention layers
Vision Encoder	Frozen during training
Input Modalities	Images, Videos, Text
Output	Text (Arabic/English)
Training Framework	LLaMAFactory
Training Data	62M multimodal examples
Languages	Arabic, English
License	Apache 2.0

Model Training

Training Data (62M Examples)

Fanar-2-Oryx-IVU was trained on a comprehensive multimodal dataset with balanced Arabic-English representation (approximately 50/50):

1. Cultural Content (24M VQA pairs)

240K internally collected images from taxonomy-driven crawling
Coverage: 22 Arab countries across cultural categories
Dense supervision: Up to 63 QA pairs per image
Bilingual VQA synthesis: English + Modern Standard Arabic
Null-field supervision: Explicit "absence" questions to reduce hallucinations
Generated via Gemini 2.5 Flash with structured metadata

2. Arabic Fonts & Calligraphy (54K pairs)

20K calligraphy images featuring Qur'anic verses
5 major Arabic scripts: Thuluth (الثلث), Naskh (النسخ), Ruq'ah (الرقعة), Kufi (الكوفي), Diwani (الديواني)
Dual objectives:
- Content identification (transcribing Arabic text)
- Script classification (recognizing calligraphic style)
All prompts and responses are in Arabic

3. Object Detection & Localization (1.6M pairs)

Based on AllenAI public datasets
Enhanced with instance-level bounding boxes
WordNet-style taxonomic expansion for robust semantic coverage
Point-based grounding: (x,y) coordinate lists for spatial reasoning
Bilingual: 800K English + 800K Arabic (translated)

4. General Image Captioning (34M pairs)

566K source images from Pixmo dataset
Detailed audio-transcribed captions (high quality)
27 paraphrased templates per language to increase diversity
17M English + 17M Arabic caption pairs

5. Text-only Instruction (1.9M)

UltraChat in English + Arabic translation
Maintains dialogue capability independent of visual input
Supports mixed text-visual interactions in realistic deployments

Training Methodology

Parameter-efficient fine-tuning: LoRA (rank 128) on attention layers
Vision encoder frozen: Preserves pretrained visual representations
Multi-run training with TIES merging: Combines complementary strengths
Training scale: 16 nodes, approximately 2 weeks
Language balance: Strict 50/50 Arabic-English distribution maintained

Key Innovations

Taxonomy-guided cultural crawling across 22 Arab countries
WordNet-style augmentation for synonyms, hypernyms, long-tail concepts
Faithfulness by design: Null-field supervision for hallucination reduction
Point-based grounding: (x,y) coordinates for spatial understanding
Native Arabic calligraphy recognition: 5 major script styles

Custom Evaluation Benchmarks

Fanar-2-Oryx-IVU was evaluated on multiple custom benchmarks designed specifically for Arabic cultural and linguistic assessment:

1. Oryx-Almieyar (12K questions)

200 images (10 per country, 20 Arab countries)
30 dialect experts for manual annotation
Three language variants: English, MSA, country-specific dialects
Country-level diagnostic analysis for geographic coverage

2. Oryx-BloomBench (7,747 pairs)

Bilingual (English/Arabic)
6 Bloom's taxonomy levels:
- Remember (2,948)
- Understand (1,592)
- Analyze (1,431)
- Create (685)
- Evaluate (592)
- Apply (499)
Tests reasoning depth beyond surface perception

3. TaskGalaxy Subset (12K samples)

Broad regression test for general capabilities
19,227 hierarchical vision task types
Bilingual Arabic/English
Prevents capability degradation during Arabic optimization

Getting Started

Oryx-IVU is compatible with the Hugging Face transformers library. Here's how to load and use the model:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import torch

model_name = "QCRI/Fanar-2-Oryx-IVU"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Load image
image = Image.open("path/to/image.jpg")

# Prepare conversation (supports Arabic or English)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "ما الذي تراه في هذه الصورة؟"}
        ]
    }
]

# Process and generate
text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=256)
generated_text = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True
)

print(generated_text[0])

Multi-turn Conversation

# First turn
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is this landmark?"}
        ]
    }
]

# ... generate response ...

# Second turn (building on context)
messages.append({"role": "assistant", "content": generated_text[0]})
messages.append({
    "role": "user",
    "content": [{"type": "text", "text": "Tell me more about its history"}]
})

# ... generate response ...

Evaluation

Multiple-Choice Benchmarks (Arabic)

Model	Arabic Culture	CamelBench	BloomBench	TaskGalaxy
Fanar-2-Oryx-IVU	48.0%	45.0%	58.0%	74.0%
Qwen2.5-VL (base)	48.0%	45.0%	58.0%	74.0%
Gemma-3-12B	40.0%	50.0%	48.0%	20.0%
Qwen2-VL-7B	30.0%	41.0%	37.0%	51.0%
AIN-7B	33.0%	45.0%	45.0%	61.0%

Note: Oryx-IVU matches base model on MCQ but excels in generation quality and Arabic coherence.

Generative Evaluation (LLM-as-a-Judge, 1-5 scale)

Evaluated on 3,300 real user queries with Gemini 2.5 Flash as judge:

Model	Average Score	Comments
GPT-4o	4.51	Strongest overall
Fanar-2-Oryx-IVU	3.03	Best among similar-sized models
Qwen3-VL	2.96	Newer but lower quality
Qwen2.5-VL (base)	2.76	Our base model
Qwen2-VL	2.21	Older version
AIN-7B	2.23	Similar size competitor

Key Achievements:

Outperforms base model by +0.27 points (10% relative improvement)
Outperforms newer Qwen3-VL despite being based on older Qwen2.5
Best among all tested 7B-class models

Language Consistency Improvements

Metric	Base Model (Qwen2.5-VL)	Fanar-2-Oryx-IVU	Improvement
Arabic-English code-switching	11%	6%	-45% reduction
Arabic-Chinese mixing	3%	1.5%	-50% reduction

User Satisfaction (3,300 queries)

Rating	Percentage
Like	70%
Dislike	25%
No Reaction	5%

Cultural Domain Excellence

Fanar-2-Oryx-IVU achieves leading performance in culturally-sensitive categories:

Food & Drink: Top performer
Islamic Culture: Top performer
Landmarks: Top performer
Country-specific content: Best for Algeria, Jordan, Palestine, Qatar, Sudan

Intended Use, Limitations & Ethical Considerations

Fanar-2-Oryx-IVU is built for:

Cultural heritage documentation and preservation
Educational applications teaching Arabic culture and history
Accessibility tools for Arabic-speaking visually impaired users
Content moderation for Arabic social media platforms
E-commerce product description generation in Arabic
Museum and tourism applications with multilingual support
Calligraphy and document analysis for historical texts
Research on Arabic vision-language understanding

Limitations:

May produce hallucinations despite mitigation strategies
Arabic text recognition in images remains challenging
Performance varies across different Arabic dialects
May reflect biases present in training data
Cannot perfectly understand all cultural nuances

Recommendations:

Verify critical information from generated responses
Use human review for sensitive applications
Provide user feedback mechanisms
Monitor for cultural appropriateness, hallucinations and errors
Consider fine-tuning for domain-specific needs
Implement fallback mechanisms for uncertain responses

Not Suitable For:

Medical diagnosis or legal advice
High-stakes decision-making
Situations requiring perfect accuracy
Replacing human judgment in cultural matters
Surveillance applications

Kindly refer to our Terms of Service and Privacy Policy.

The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT, or any other organization or individual.

Fanar Platform

While Fanar-2-27B-Instruct is a powerful standalone model, it is part of the broader Fanar Platform—an integrated Arabic-centric multimodal AI ecosystem that provides enhanced capabilities and continuous updates. The platform includes:

Core Capabilities:

Text Generation: Multiple conversational models optimized for different tasks
Speech (Aura): Speech-to-text (short-form and long-form) and text-to-speech synthesis with Arabic dialect support and bilingual Arabic-English capabilities
Image Understanding (Oryx-IVU): Vision-language model for culturally-grounded image and video understanding including Arabic calligraphy recognition
Image Generation (Oryx-IG): Culturally-aligned text-to-image generation trained on taxonomy-driven data across 23,000+ cultural search terms
Machine Translation (FanarShaheen): High-quality bilingual Arabic↔English translation across diverse domains (e.g., news, STEM, and medical)
Poetry Generation (Diwan): Classical Arabic poetry generation respecting prosodic meters (Buhur) and maintaining diacritization accuracy

Specialized Systems:

Fanar-Sadiq: Multi-agent Islamic question-answering system with 9 specialized tools (Fiqh reasoning, Quran/Hadith retrieval, zakat/inheritance calculation, prayer times, and Hijri calendar). Deployed in production on IslamWeb and IslamOnline platforms.
Safety & Moderation: Fanar-Guard and culturally-informed content filtering trained on 468K annotated Arabic-English safety examples

Access Points:

Fanar Chat: Web conversational interface integrating all modalities
iOS and Android apps: Mobile apps for on-the-go access to the Fanar Platform
Fanar API: Programmatic access to models and specialized capabilities

The Fanar Platform continuously evolves with model updates, new capabilities, and improved safety mechanisms. For production deployments requiring the latest features, multimodal integration, cross-model orchestration, and ongoing support, we recommend using the Fanar Platform rather than the standalone models published here.

Citation

If you use Fanar-2-Oryx-IVU or the Fanar 2.0 GenAI platform in your research or applications, please cite:

@misc{fanarteam2026fanar20arabicgenerative,
      title={Fanar 2.0: Arabic Generative AI Stack}, 
      author={FANAR TEAM and Ummar Abbas and Mohammad Shahmeer Ahmad and Minhaj Ahmad and Abdulaziz Al-Homaid and Anas Al-Nuaimi and Enes Altinisik and Ehsaneddin Asgari and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Asim Ersoy and Masoomali Fatehkia and Mohammed Qusay Hashim and Majd Hawasly and Mohamed Hefeeda and Mus'ab Husaini and Keivin Isufaj and Soon-Gyo Jung and Houssam Lachemat and Ji Kim Lucas and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Mourad Ouzzani and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang},
      year={2026},
      eprint={2603.16397},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.16397}, 
}

Acknowledgements

This project is from Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.

Special thanks to the Ministry of Communications and Information Technology, State of Qatar for their continued support by providing the compute infrastructure needed to develop and serve the platform through the Google Cloud Platform.