Overview
Roa'ya-VL (رؤيا ) is a bilingual Arabic-English vision-language model built on LLaVA-NeXT architecture, combining DeepSeek-OCR vision encoder with Qwen2.5-3B-Instruct language model.
Developed at King Saud University, Saudi Arabia 🇸🇦
Training Pipeline
| Stage | Description | Samples | Trainable Parts |
|---|---|---|---|
| Stage 1 | Vision-Language Alignment | ~558K | Projector only |
| Stage 2 | Bilingual Instruction Tuning | ~18M (Finevision+Arabic)) | Projector + LLM |
Model Details
| Component | Specification |
|---|---|
| Vision Encoder | DeepSeek-OCR |
| Language Model | Qwen2.5-3B-Instruct |
| Projector | MLP 2x GELU |
| Hidden Size | 2048 |
| Context Length | 32K tokens |
| Parameters | ~3B |
Quick Start
Installation
git clone https://github.com/yakoubbazi/Roaya-VL.git
cd Roaya-VL
pip install -r requirements.txt
Inference
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX
from PIL import Image
import torch
# Load model
model_path = "BigData-KSU/Roaya-VL-3B"
tokenizer, model, image_processor, _ = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name="roaya-vl-3b"
)
# Process image
image = Image.open("image.jpg").convert("RGB")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.bfloat16)
# Arabic prompt
prompt = "<image>\nما هو الموجود في هذه الصورة؟"
# English prompt
# prompt = "<image>\nWhat is in this image?"
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.unsqueeze(0).to(model.device)
# Generate
with torch.no_grad():
output = model.generate(input_ids, images=image_tensor, max_new_tokens=512)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)
Evaluation Results
Evaluation Results
General VLM Benchmarks
| Model | MME | MMStar | MMBench_test (en) | SeedBench_image | ScienceQA_val | RealworldQA | AI2D_w/o M | Average |
|---|---|---|---|---|---|---|---|---|
| TinyLLaVA-3B | 1733 | 37.9 | 69.5 | 70.2 | 68.7 | 55.0 | 61.8 | 61.8 |
| PaliGemma2-3B | 1658 | 52.7 | 60.7 | 71.6 | 94.3 | 58.3 | 72.2 | 68.0 |
| Phi-3.5-Vision-4B | 1846 | 47.5 | 76.0 | 71.2 | 92.2 | 57.9 | 77.8 | 70.9 |
| SmolVLM2-2B | 1764 | 46.0 | 43.0 | 70.9 | 90.0 | 58.4 | 74.9 | 64.8 |
| InternVL2.5-4B | 2338 | 58.3 | 81.1 | 74.1 | 97.0 | 64.3 | 81.4 | 78.5 |
| Qwen2.5VL-3B | 2171 | 54.3 | 78.2 | 73.3 | 81.4 | 65.4 | 81.6 | 74.4 |
| MaTVLM-3B | 1771 | 37.5 | 69.4 | 65.6 | 65.9 | 52.3 | 58.9 | 60.1 |
| Cobra-3B | 1346 | 34.7 | 55.9 | 63.3 | 60.3 | 51.0 | 46.8 | 52.3 |
| InfiniteVL-4B | 2126 | 55.6 | 79.0 | 72.9 | 93.4 | 67.3 | 77.2 | 75.8 |
| Roa’ya-VL-3B (ours) |
OCR/Doc/VQA Benchmarks
| Model | ChartQA_test | TextVQA_val | DocVQA_test | OCRBench | MMMU_val | MathVista_mini | Average |
|---|---|---|---|---|---|---|---|
| TinyLLaVA-3B | 21.2 | 55.3 | 34.7 | 36.0 | 36.2 | 28.3 | 35.3 |
| PaliGemma2-3B | 33.6 | 63.0 | 71.6 | 60.1 | 30.3 | 27.7 | 47.7 |
| Phi-3.5-Vision-4B | 81.8 | 72.0 | 69.3 | 59.9 | 43.0 | 43.9 | 61.7 |
| SmolVLM2-2B | 68.8 | 73.2 | 80.0 | 72.9 | 42.0 | 51.5 | 64.7 |
| InternVL2.5-4B | 84.0 | 76.8 | 91.6 | 82.8 | 52.3 | 60.5 | 74.7 |
| Qwen2.5VL-3B | 84.0 | 79.6 | 93.9 | 79.7 | 49.6 | 62.3 | 74.9 |
| MaTVLM-3B | 20.0 | 53.2 | 33.0 | 35.1 | 34.4 | 28.5 | 34.0 |
| Cobra-3B | 17.9 | 47.9 | 24.0 | 30.7 | 31.5 | 22.3 | 29.1 |
| InfiniteVL-4B | 82.0 | 78.5 | 91.7 | 79.8 | 44.0 | 65.4 | 73.6 |
| Roa’ya-VL-3B (ours) |
Requirements
- Python >= 3.10
- PyTorch >= 2.0
- Transformers >= 4.40.0
- Flash Attention 2 (recommended)
Citation
@misc{roaya-vl-2025,
title={Roa'ya-VL-3B (رؤيا): Best Practices for Building Bilingual Arabic-English Vision-Language Models},
author={Yakoub Bazi, Mansour Zuair, Mohamad Mahmoud Al Rahhal},
year={2025},
url={https://github.com/yakoubbazi/Roaya-VL}
}
Acknowledgements
- LLaVA-NeXT for the base architecture
- Qwen2.5 for the language model
- DeepSeek-ocr for the vision encoder
License
This project is licensed under the Apache 2.0 License.
Developed at King Saud University, Saudi Arabia 🇸🇦
- Downloads last month
- 44