Instructions to use NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct

SGLang

How to use NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct with Docker Model Runner:
```
docker model run hf.co/NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

QARI-OCR v0.3: Structural Arabic Document Understanding

Model Description

QARI-OCR v0.3 is a specialized vision-language model fine-tuned for Arabic Optical Character Recognition with a focus on structural document understanding.
Built on Qwen2-VL-2B-Instruct, this model excels at preserving document layouts, HTML tags, and formatting while transcribing Arabic text.
It is described in detail in the paper QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation.

Key Features

📐 Layout-Aware Recognition: Preserves document structure with HTML/Markdown tags
🔤 Full Diacritics Support: Accurate recognition of tashkeel (Arabic diacritical marks)
📝 Multi-Font Handling: Trained on 12 diverse Arabic fonts (14px-100px)
🎯 Structure-First Design: Optimized for documents with headers, body text, and complex layouts
⚡ Efficient Training: Only 11 hours on single GPU with 10k samples
🖼️ Robust Performance: Handles low-resolution and degraded images

Model Performance

Metric	Score
Character Error Rate (CER)	0.300
Word Error Rate (WER)	0.485
BLEU Score	0.545
Training Time	11 hours
CO₂ Emissions	1.88 kg eq.

Comparative Strengths

While QARI v0.2 achieves better raw text accuracy (CER: 0.061), QARI v0.3 excels in:

✅ HTML/Markdown structure preservation
✅ Document layout understanding
✅ Handwritten text recognition (initial capabilities)
✅ 5x faster training than v0.2

How to Use

Try Qari - Google Colab

You can load this model using the transformers and qwen_vl_utils library:

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info



model_name = "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

Try the model on Google Colab, Notebook

Training Details

Base Model: Qwen2-VL-2B-Instruct
Training Data: 10,000 synthetic Arabic documents with HTML markup
Optimization: 4-bit LoRA adapters (rank=16)
Hardware: Single NVIDIA A6000 GPU (48GB)
Framework: Unsloth + Hugging Face TRL

BibTeX:

@article{wasfy2025qari,
  title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
  author={Wasfy, Ahmed and Nacar, Omer and Elkhateb, Abdelakreem and Reda, Mahmoud and Elshehy, Omar and Ammar, Adel and Boulila, Wadii},
  journal={arXiv preprint arXiv:2506.02295},
  year={2025}
}