Roa'ya-VL

Roa'ya-VL-3B (رؤيا): Best Practices for Building Bilingual Arabic-English Vision-Language Models

GitHub License arXiv

GitHub | Paper (Coming Soon)


Overview

Roa'ya-VL (رؤيا ) is a bilingual Arabic-English vision-language model built on LLaVA-NeXT architecture, combining DeepSeek-OCR vision encoder with Qwen2.5-3B-Instruct language model.

Developed at King Saud University, Saudi Arabia 🇸🇦


Training Pipeline

Stage Description Samples Trainable Parts
Stage 1 Vision-Language Alignment ~558K Projector only
Stage 2 Bilingual Instruction Tuning ~18M (Finevision+Arabic)) Projector + LLM

Model Details

Component Specification
Vision Encoder DeepSeek-OCR
Language Model Qwen2.5-3B-Instruct
Projector MLP 2x GELU
Hidden Size 2048
Context Length 32K tokens
Parameters ~3B

Quick Start

Installation

git clone https://github.com/yakoubbazi/Roaya-VL.git
cd Roaya-VL
pip install -r requirements.txt

Inference

from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX
from PIL import Image
import torch

# Load model
model_path = "BigData-KSU/Roaya-VL-3B"
tokenizer, model, image_processor, _ = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name="roaya-vl-3b"
)

# Process image
image = Image.open("image.jpg").convert("RGB")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.bfloat16)

# Arabic prompt
prompt = "<image>\nما هو الموجود في هذه الصورة؟"

# English prompt  
# prompt = "<image>\nWhat is in this image?"

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.unsqueeze(0).to(model.device)

# Generate
with torch.no_grad():
    output = model.generate(input_ids, images=image_tensor, max_new_tokens=512)

response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Evaluation Results

Evaluation Results

General VLM Benchmarks

Model MME MMStar MMBench_test (en) SeedBench_image ScienceQA_val RealworldQA AI2D_w/o M Average
TinyLLaVA-3B 1733 37.9 69.5 70.2 68.7 55.0 61.8 61.8
PaliGemma2-3B 1658 52.7 60.7 71.6 94.3 58.3 72.2 68.0
Phi-3.5-Vision-4B 1846 47.5 76.0 71.2 92.2 57.9 77.8 70.9
SmolVLM2-2B 1764 46.0 43.0 70.9 90.0 58.4 74.9 64.8
InternVL2.5-4B 2338 58.3 81.1 74.1 97.0 64.3 81.4 78.5
Qwen2.5VL-3B 2171 54.3 78.2 73.3 81.4 65.4 81.6 74.4
MaTVLM-3B 1771 37.5 69.4 65.6 65.9 52.3 58.9 60.1
Cobra-3B 1346 34.7 55.9 63.3 60.3 51.0 46.8 52.3
InfiniteVL-4B 2126 55.6 79.0 72.9 93.4 67.3 77.2 75.8
Roa’ya-VL-3B (ours)

OCR/Doc/VQA Benchmarks

Model ChartQA_test TextVQA_val DocVQA_test OCRBench MMMU_val MathVista_mini Average
TinyLLaVA-3B 21.2 55.3 34.7 36.0 36.2 28.3 35.3
PaliGemma2-3B 33.6 63.0 71.6 60.1 30.3 27.7 47.7
Phi-3.5-Vision-4B 81.8 72.0 69.3 59.9 43.0 43.9 61.7
SmolVLM2-2B 68.8 73.2 80.0 72.9 42.0 51.5 64.7
InternVL2.5-4B 84.0 76.8 91.6 82.8 52.3 60.5 74.7
Qwen2.5VL-3B 84.0 79.6 93.9 79.7 49.6 62.3 74.9
MaTVLM-3B 20.0 53.2 33.0 35.1 34.4 28.5 34.0
Cobra-3B 17.9 47.9 24.0 30.7 31.5 22.3 29.1
InfiniteVL-4B 82.0 78.5 91.7 79.8 44.0 65.4 73.6
Roa’ya-VL-3B (ours)

Requirements

  • Python >= 3.10
  • PyTorch >= 2.0
  • Transformers >= 4.40.0
  • Flash Attention 2 (recommended)

Citation

@misc{roaya-vl-2025,
  title={Roa'ya-VL-3B (رؤيا): Best Practices for Building Bilingual Arabic-English Vision-Language Models},
  author={Yakoub Bazi, Mansour Zuair, Mohamad Mahmoud Al Rahhal},
  year={2025},
  url={https://github.com/yakoubbazi/Roaya-VL}
}

Acknowledgements


License

This project is licensed under the Apache 2.0 License.


Developed at King Saud University, Saudi Arabia 🇸🇦

Downloads last month
44
Safetensors
Model size
3B params
Tensor type
BF16
·
I64
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BigData-KSU/Roaya-VL-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(1094)
this model