Roaya-VL-3B / README.md
BigData-KSU's picture
Update README.md
4d96175 verified
metadata
license: apache-2.0
language:
  - en
  - ar
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - arabic
  - ocr
  - llava
  - qwen2.5
  - deepseek
  - bilingual
base_model:
  - Qwen/Qwen2.5-3B-Instruct
Roa'ya-VL

Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding

BigData@AI Research Team: Yakoub Bazi, Mansour Zuair and Mohamad Mahmoud Al Rahhal

GitHub License arXiv

GitHub | Paper (Coming Soon)


Overview

Roa'ya-VL (رؤيا ) is a Bilingual Arabic-English vision-language model based on DeepSeek-OCR vision encoder and Qwen2.5-3B-Instruct language model.

Developed at College of Computer and Information Sciences, King Saud University.


Training Pipeline

Stage Description Samples Trainable Parts
Stage 1 Vision-Language Alignment ~558K Projector only
Stage 2 Bilingual Instruction Tuning ~18M (Finevision+Arabic)) Projector + LLM

Model Details

Component Specification
Vision Encoder DeepSeek-OCR Vision Encoder (Fusion on SAM and CLIP)
Language Model Qwen2.5-3B-Instruct
Projector MLP 2x GELU
Hidden Size 2048
Context Length 32K tokens
Parameters ~3B

Quick Start

Installation

conda create -n roaya3B python=3.11 -y
conda activate roaya3B
python -m pip install --upgrade pip
git clone https://github.com/yakoubbazi/Roaya-VL.git
cd Roaya-VL
pip install -r requirements.txt

Inference

Inference

from inference.Roaya_wrapper import RoayaVLWrapper

# Model location (local path or HF repo id)
model_path = "./checkpoints/Roaya-VL-3B"  # or "BigData-KSU/Roaya-VL-3B"

# Example image
image_path = "examples/Train_Pipeline.png"

# Arabic prompt
prompt = "صف ما يوجد في هذه الصورة"

# Load model
model = RoayaVLWrapper(model_path, device="cuda", verbose=True)

# Generate
response = model.generate(
    prompt,
    images=[image_path],
    max_new_tokens=256,
    temperature=0.1
)

print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")
print(response)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")

Evaluation Results

General VLM Benchmarks

|## Evaluation Results

Note: For Roa’ya-VL-3B, some evaluations were conducted on the validation/dev splits because the official test sets are not publicly available. Results marked with a * correspond to evaluations performed on dev/validation sets rather than the official test split.

General VLM Benchmarks

Model MME MMStar MMBench_test (en) SeedBench_image ScienceQA_val RealworldQA AI2D_w/o M Average
TinyLLaVA-3B 1733 37.9 69.5 70.2 68.7 55.0 61.8 60.52
PaliGemma2-3B 1658 52.7 60.7 71.6 94.3 58.3 72.2 68.30
Phi-3.5-Vision-4B 1846 47.5 76.0 71.2 92.2 57.9 77.8 70.43
SmolVLM2-2B 1764 46.0 43.0 70.9 90.0 58.4 74.9 63.87
InternVL2.5-4B 2338 58.3 81.1 74.1 97.0 64.3 81.4 76.03
Qwen2.5VL-3B 2171 54.3 78.2 73.3 81.4 65.4 81.6 72.37
MaTVLM-3B 1771 37.5 69.4 65.6 65.9 52.3 58.9 58.27
Cobra-3B 1346 34.7 55.9 63.3 60.3 51.0 46.8 50.33
InfiniteVL-4B 2126 55.6 79.0 72.9 93.4 67.3 77.2 74.23
Roa’ya-VL-3B (ours) 1847 50.66 69.61* 71.17 83.30 59.38 84.29 69.74

OCR/Doc/VQA Benchmarks

Model ChartQA_test TextVQA_val DocVQA_test OCRBench MMMU_val Average
TinyLLaVA-3B 21.2 55.3 34.7 36.0 36.2 36.7
PaliGemma2-3B 33.6 63.0 71.6 60.1 30.3 51.7
Phi-3.5-Vision-4B 81.8 72.0 69.3 59.9 43.0 65.2
SmolVLM2-2B 68.8 73.2 80.0 72.9 42.0 67.4
InternVL2.5-4B 84.0 76.8 91.6 82.8 52.3 77.5
Qwen2.5VL-3B 84.0 79.6 93.9 79.7 49.6 77.4
MaTVLM-3B 20.0 53.2 33.0 35.1 34.4 35.1
Cobra-3B 17.9 47.9 24.0 30.7 31.5 30.4
InfiniteVL-4B 82.0 78.5 91.7 79.8 44.0 75.2
Roa’ya-VL-3B (ours) 77.80 61.68 83.98* 60.9 40.44 64.96

Citation

@misc{roaya-vl-2025,
  title={Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding},
  author={Yakoub Bazi, Mansour Zuair, Mohamad Mahmoud Al Rahhal},
  year={2025},
  url={https://github.com/yakoubbazi/Roaya-VL}
}

Acknowledgements


License

This project is licensed under the Apache 2.0 License.