Update README.md

4d96175 verified about 2 months ago

5.51 kB

license: apache-2.0
language:
  - en
  - ar
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - arabic
  - ocr
  - llava
  - qwen2.5
  - deepseek
  - bilingual
base_model:
  - Qwen/Qwen2.5-3B-Instruct

Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding

BigData@AI Research Team: Yakoub Bazi, Mansour Zuair and Mohamad Mahmoud Al Rahhal

GitHub | Paper (Coming Soon)

Overview

Roa'ya-VL (رؤيا ) is a Bilingual Arabic-English vision-language model based on DeepSeek-OCR vision encoder and Qwen2.5-3B-Instruct language model.

Developed at College of Computer and Information Sciences, King Saud University.

Training Pipeline

Stage	Description	Samples	Trainable Parts
Stage 1	Vision-Language Alignment	~558K	Projector only
Stage 2	Bilingual Instruction Tuning	~18M (Finevision+Arabic))	Projector + LLM

Model Details

Component	Specification
Vision Encoder	DeepSeek-OCR Vision Encoder (Fusion on SAM and CLIP)
Language Model	Qwen2.5-3B-Instruct
Projector	MLP 2x GELU
Hidden Size	2048
Context Length	32K tokens
Parameters	~3B

Quick Start

Installation

conda create -n roaya3B python=3.11 -y
conda activate roaya3B
python -m pip install --upgrade pip
git clone https://github.com/yakoubbazi/Roaya-VL.git
cd Roaya-VL
pip install -r requirements.txt

Inference

from inference.Roaya_wrapper import RoayaVLWrapper

# Model location (local path or HF repo id)
model_path = "./checkpoints/Roaya-VL-3B"  # or "BigData-KSU/Roaya-VL-3B"

# Example image
image_path = "examples/Train_Pipeline.png"

# Arabic prompt
prompt = "صف ما يوجد في هذه الصورة"

# Load model
model = RoayaVLWrapper(model_path, device="cuda", verbose=True)

# Generate
response = model.generate(
    prompt,
    images=[image_path],
    max_new_tokens=256,
    temperature=0.1
)

print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")
print(response)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")

Evaluation Results

General VLM Benchmarks

|## Evaluation Results

Note: For Roa’ya-VL-3B, some evaluations were conducted on the validation/dev splits because the official test sets are not publicly available. Results marked with a * correspond to evaluations performed on dev/validation sets rather than the official test split.

General VLM Benchmarks

Model	MME	MMStar	MMBench_test (en)	SeedBench_image	ScienceQA_val	RealworldQA	AI2D_w/o M	Average
TinyLLaVA-3B	1733	37.9	69.5	70.2	68.7	55.0	61.8	60.52
PaliGemma2-3B	1658	52.7	60.7	71.6	94.3	58.3	72.2	68.30
Phi-3.5-Vision-4B	1846	47.5	76.0	71.2	92.2	57.9	77.8	70.43
SmolVLM2-2B	1764	46.0	43.0	70.9	90.0	58.4	74.9	63.87
InternVL2.5-4B	2338	58.3	81.1	74.1	97.0	64.3	81.4	76.03
Qwen2.5VL-3B	2171	54.3	78.2	73.3	81.4	65.4	81.6	72.37
MaTVLM-3B	1771	37.5	69.4	65.6	65.9	52.3	58.9	58.27
Cobra-3B	1346	34.7	55.9	63.3	60.3	51.0	46.8	50.33
InfiniteVL-4B	2126	55.6	79.0	72.9	93.4	67.3	77.2	74.23
Roa’ya-VL-3B (ours)	1847	50.66	69.61*	71.17	83.30	59.38	84.29	69.74

OCR/Doc/VQA Benchmarks

Model	ChartQA_test	TextVQA_val	DocVQA_test	OCRBench	MMMU_val	Average
TinyLLaVA-3B	21.2	55.3	34.7	36.0	36.2	36.7
PaliGemma2-3B	33.6	63.0	71.6	60.1	30.3	51.7
Phi-3.5-Vision-4B	81.8	72.0	69.3	59.9	43.0	65.2
SmolVLM2-2B	68.8	73.2	80.0	72.9	42.0	67.4
InternVL2.5-4B	84.0	76.8	91.6	82.8	52.3	77.5
Qwen2.5VL-3B	84.0	79.6	93.9	79.7	49.6	77.4
MaTVLM-3B	20.0	53.2	33.0	35.1	34.4	35.1
Cobra-3B	17.9	47.9	24.0	30.7	31.5	30.4
InfiniteVL-4B	82.0	78.5	91.7	79.8	44.0	75.2
Roa’ya-VL-3B (ours)	77.80	61.68	83.98*	60.9	40.44	64.96

Citation

@misc{roaya-vl-2025,
  title={Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding},
  author={Yakoub Bazi, Mansour Zuair, Mohamad Mahmoud Al Rahhal},
  year={2025},
  url={https://github.com/yakoubbazi/Roaya-VL}
}

Acknowledgements

LLaVA-NeXT for the base architecture
Qwen2.5 for the language model
DeepSeek-ocr for the extracting (SAM+CLIP) vision encoder

License

This project is licensed under the Apache 2.0 License.