Instructions to use BigData-KSU/Roaya-VL-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BigData-KSU/Roaya-VL-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="BigData-KSU/Roaya-VL-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("BigData-KSU/Roaya-VL-3B", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use BigData-KSU/Roaya-VL-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BigData-KSU/Roaya-VL-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BigData-KSU/Roaya-VL-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/BigData-KSU/Roaya-VL-3B

SGLang

How to use BigData-KSU/Roaya-VL-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "BigData-KSU/Roaya-VL-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BigData-KSU/Roaya-VL-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "BigData-KSU/Roaya-VL-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BigData-KSU/Roaya-VL-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use BigData-KSU/Roaya-VL-3B with Docker Model Runner:
```
docker model run hf.co/BigData-KSU/Roaya-VL-3B
```

Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding

BigData@AI Research Team: Yakoub Bazi, Mansour Zuair and Mohamad Mahmoud Al Rahhal

GitHub | Paper (Coming Soon)

Overview

Roa'ya-VL (رؤيا ) is a Bilingual Arabic-English vision-language model based on DeepSeek-OCR vision encoder and Qwen2.5-3B-Instruct language model.

Developed at College of Computer and Information Sciences, King Saud University.

Training Pipeline

Stage	Description	Samples	Trainable Parts
Stage 1	Vision-Language Alignment	~558K	Projector only
Stage 2	Bilingual Instruction Tuning	~18M (Finevision+Arabic))	Projector + LLM

Model Details

Component	Specification
Vision Encoder	DeepSeek-OCR Vision Encoder (Fusion on SAM and CLIP)
Language Model	Qwen2.5-3B-Instruct
Projector	MLP 2x GELU
Hidden Size	2048
Context Length	32K tokens
Parameters	~3B

Quick Start

Installation

conda create -n roaya3B python=3.11 -y
conda activate roaya3B
python -m pip install --upgrade pip
git clone https://github.com/yakoubbazi/Roaya-VL.git
cd Roaya-VL
pip install -r requirements.txt

Inference

from inference.Roaya_wrapper import RoayaVLWrapper

# Model location (local path or HF repo id)
model_path = "./checkpoints/Roaya-VL-3B"  # or "BigData-KSU/Roaya-VL-3B"

# Example image
image_path = "examples/Train_Pipeline.png"

# Arabic prompt
prompt = "صف ما يوجد في هذه الصورة"

# Load model
model = RoayaVLWrapper(model_path, device="cuda", verbose=True)

# Generate
response = model.generate(
    prompt,
    images=[image_path],
    max_new_tokens=256,
    temperature=0.1
)

print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")
print(response)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++")

Evaluation Results

General VLM Benchmarks

|## Evaluation Results

Note: For Roa’ya-VL-3B, some evaluations were conducted on the validation/dev splits because the official test sets are not publicly available. Results marked with a * correspond to evaluations performed on dev/validation sets rather than the official test split.

General VLM Benchmarks

Model	MME	MMStar	MMBench_test (en)	SeedBench_image	ScienceQA_val	RealworldQA	AI2D_w/o M	Average
TinyLLaVA-3B	1733	37.9	69.5	70.2	68.7	55.0	61.8	60.52
PaliGemma2-3B	1658	52.7	60.7	71.6	94.3	58.3	72.2	68.30
Phi-3.5-Vision-4B	1846	47.5	76.0	71.2	92.2	57.9	77.8	70.43
SmolVLM2-2B	1764	46.0	43.0	70.9	90.0	58.4	74.9	63.87
InternVL2.5-4B	2338	58.3	81.1	74.1	97.0	64.3	81.4	76.03
Qwen2.5VL-3B	2171	54.3	78.2	73.3	81.4	65.4	81.6	72.37
MaTVLM-3B	1771	37.5	69.4	65.6	65.9	52.3	58.9	58.27
Cobra-3B	1346	34.7	55.9	63.3	60.3	51.0	46.8	50.33
InfiniteVL-4B	2126	55.6	79.0	72.9	93.4	67.3	77.2	74.23
Roa’ya-VL-3B (ours)	1847	50.66	69.61*	71.17	83.30	59.38	84.29	69.74

OCR/Doc/VQA Benchmarks

Model	ChartQA_test	TextVQA_val	DocVQA_test	OCRBench	MMMU_val	Average
TinyLLaVA-3B	21.2	55.3	34.7	36.0	36.2	36.7
PaliGemma2-3B	33.6	63.0	71.6	60.1	30.3	51.7
Phi-3.5-Vision-4B	81.8	72.0	69.3	59.9	43.0	65.2
SmolVLM2-2B	68.8	73.2	80.0	72.9	42.0	67.4
InternVL2.5-4B	84.0	76.8	91.6	82.8	52.3	77.5
Qwen2.5VL-3B	84.0	79.6	93.9	79.7	49.6	77.4
MaTVLM-3B	20.0	53.2	33.0	35.1	34.4	35.1
Cobra-3B	17.9	47.9	24.0	30.7	31.5	30.4
InfiniteVL-4B	82.0	78.5	91.7	79.8	44.0	75.2
Roa’ya-VL-3B (ours)	77.80	61.68	83.98*	60.9	40.44	64.96

Citation

@misc{roaya-vl-2025,
  title={Roa'ya-VL (رؤيا): A Bilingual Arabic–English Vision-Language Model for Multimodal Understanding},
  author={Yakoub Bazi, Mansour Zuair, Mohamad Mahmoud Al Rahhal},
  year={2025},
  url={https://github.com/yakoubbazi/Roaya-VL}
}

Acknowledgements

LLaVA-NeXT for the base architecture
Qwen2.5 for the language model
DeepSeek-ocr for the extracting (SAM+CLIP) vision encoder

License

This project is licensed under the Apache 2.0 License.

Downloads last month: 10

Safetensors

Model size

3B params

Tensor type

BF16

I64

Model tree for BigData-KSU/Roaya-VL-3B

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1276)

this model