metadata
license: apache-2.0
language:
- en
- ar
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- arabic
- ocr
- llava
- qwen2.5
- deepseek
- bilingual
base_model:
- Qwen/Qwen2.5-3B-Instruct
Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding
BigData@AI Research Team: Yakoub Bazi, Mansour Zuair and Mohamad Mahmoud Al Rahhal
Overview
Roa'ya-VL (رؤيا ) is a Bilingual Arabic-English vision-language model based on DeepSeek-OCR vision encoder and Qwen2.5-3B-Instruct language model.
Developed at College of Computer and Information Sciences, King Saud University.
Training Pipeline
| Stage | Description | Samples | Trainable Parts |
|---|---|---|---|
| Stage 1 | Vision-Language Alignment | ~558K | Projector only |
| Stage 2 | Bilingual Instruction Tuning | ~18M (Finevision+Arabic)) | Projector + LLM |
Model Details
| Component | Specification |
|---|---|
| Vision Encoder | DeepSeek-OCR Vision Encoder (Fusion on SAM and CLIP) |
| Language Model | Qwen2.5-3B-Instruct |
| Projector | MLP 2x GELU |
| Hidden Size | 2048 |
| Context Length | 32K tokens |
| Parameters | ~3B |
Quick Start
Installation
conda create -n roaya3B python=3.11 -y
conda activate roaya3B
python -m pip install --upgrade pip
git clone https://github.com/yakoubbazi/Roaya-VL.git
cd Roaya-VL
pip install -r requirements.txt
Inference
Inference
from inference.Roaya_wrapper import RoayaVLWrapper
# Model location (local path or HF repo id)
model_path = "./checkpoints/Roaya-VL-3B" # or "BigData-KSU/Roaya-VL-3B"
# Example image
image_path = "examples/Train_Pipeline.png"
# Arabic prompt
prompt = "صف ما يوجد في هذه الصورة"
# Load model
model = RoayaVLWrapper(model_path, device="cuda", verbose=True)
# Generate
response = model.generate(
prompt,
images=[image_path],
max_new_tokens=256,
temperature=0.1
)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")
print(response)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")
Evaluation Results
General VLM Benchmarks
|## Evaluation Results
Note: For Roa’ya-VL-3B, some evaluations were conducted on the validation/dev splits because the official test sets are not publicly available. Results marked with a * correspond to evaluations performed on dev/validation sets rather than the official test split.
General VLM Benchmarks
| Model | MME | MMStar | MMBench_test (en) | SeedBench_image | ScienceQA_val | RealworldQA | AI2D_w/o M | Average |
|---|---|---|---|---|---|---|---|---|
| TinyLLaVA-3B | 1733 | 37.9 | 69.5 | 70.2 | 68.7 | 55.0 | 61.8 | 60.52 |
| PaliGemma2-3B | 1658 | 52.7 | 60.7 | 71.6 | 94.3 | 58.3 | 72.2 | 68.30 |
| Phi-3.5-Vision-4B | 1846 | 47.5 | 76.0 | 71.2 | 92.2 | 57.9 | 77.8 | 70.43 |
| SmolVLM2-2B | 1764 | 46.0 | 43.0 | 70.9 | 90.0 | 58.4 | 74.9 | 63.87 |
| InternVL2.5-4B | 2338 | 58.3 | 81.1 | 74.1 | 97.0 | 64.3 | 81.4 | 76.03 |
| Qwen2.5VL-3B | 2171 | 54.3 | 78.2 | 73.3 | 81.4 | 65.4 | 81.6 | 72.37 |
| MaTVLM-3B | 1771 | 37.5 | 69.4 | 65.6 | 65.9 | 52.3 | 58.9 | 58.27 |
| Cobra-3B | 1346 | 34.7 | 55.9 | 63.3 | 60.3 | 51.0 | 46.8 | 50.33 |
| InfiniteVL-4B | 2126 | 55.6 | 79.0 | 72.9 | 93.4 | 67.3 | 77.2 | 74.23 |
| Roa’ya-VL-3B (ours) | 1847 | 50.66 | 69.61* | 71.17 | 83.30 | 59.38 | 84.29 | 69.74 |
OCR/Doc/VQA Benchmarks
| Model | ChartQA_test | TextVQA_val | DocVQA_test | OCRBench | MMMU_val | Average |
|---|---|---|---|---|---|---|
| TinyLLaVA-3B | 21.2 | 55.3 | 34.7 | 36.0 | 36.2 | 36.7 |
| PaliGemma2-3B | 33.6 | 63.0 | 71.6 | 60.1 | 30.3 | 51.7 |
| Phi-3.5-Vision-4B | 81.8 | 72.0 | 69.3 | 59.9 | 43.0 | 65.2 |
| SmolVLM2-2B | 68.8 | 73.2 | 80.0 | 72.9 | 42.0 | 67.4 |
| InternVL2.5-4B | 84.0 | 76.8 | 91.6 | 82.8 | 52.3 | 77.5 |
| Qwen2.5VL-3B | 84.0 | 79.6 | 93.9 | 79.7 | 49.6 | 77.4 |
| MaTVLM-3B | 20.0 | 53.2 | 33.0 | 35.1 | 34.4 | 35.1 |
| Cobra-3B | 17.9 | 47.9 | 24.0 | 30.7 | 31.5 | 30.4 |
| InfiniteVL-4B | 82.0 | 78.5 | 91.7 | 79.8 | 44.0 | 75.2 |
| Roa’ya-VL-3B (ours) | 77.80 | 61.68 | 83.98* | 60.9 | 40.44 | 64.96 |
Citation
@misc{roaya-vl-2025,
title={Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding},
author={Yakoub Bazi, Mansour Zuair, Mohamad Mahmoud Al Rahhal},
year={2025},
url={https://github.com/yakoubbazi/Roaya-VL}
}
Acknowledgements
- LLaVA-NeXT for the base architecture
- Qwen2.5 for the language model
- DeepSeek-ocr for the extracting (SAM+CLIP) vision encoder
License
This project is licensed under the Apache 2.0 License.