🚀 Arabic Menu OCR v2 (Qwen2.5-VL)

This is a specialized Vision-Language Model (VLM) fine-tuned exclusively for extracting food/drink items and prices from Arabic restaurant menus. It processes raw menu images and outputs a highly structured, clean JSON array, making it perfect for automated data entry, food delivery apps, and digital menu generation.

This model is a fully merged version of the powerful Qwen2.5-VL-3B-Instruct, fine-tuned using the LLaMA-Factory framework.

🎯 Model Capabilities

Accurate Arabic OCR: Specialized in reading complex Arabic typography, handwriting, and heavily stylized restaurant menus.
Structured Output (JSON): Guaranteed to output a strict JSON format containing exact item names and prices.
Smart Filtering: Automatically ignores categories, restaurant names, phone numbers, and items without prices.
Variant Handling: Splits multi-size items natively (e.g., outputs "Small Pizza" and "Large Pizza" as distinct items rather than lumping them together).

🛠️ Requirements & Installation

Because this model uses the Qwen 2.5 architecture, you must install the official Qwen utilities alongside transformers to properly process image inputs.

pip install -U torch torchvision transformers accelerate optimum vllm
pip install qwen-vl-utils[decord]==0.0.8

💻 Usage (Standard Transformers)

The prompt must match the exact formatting used during training. Use the standard pipeline below for accurate generation.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_id = "mohamedashraff22/arabic-menu-ocr-v2"

# 1. Load Model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    device_map="auto",
    trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# 2. Strict Training Prompt
prompt = """Extract every food or drink item and its price from this menu image.
Keep item names and prices exactly as written in the image.
If an item has multiple sizes, list each size as a separate entry.
Only include items that have a visible price. Skip anything without a price.
Return a JSON object with one key "items" containing an array. Each item has "name" (string) and "price" (string)."""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/menu_image.jpg"},
            {"type": "text", "text": prompt}
        ]
    }
]

# 3. Process & Format
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

# 4. Generate Output
with torch.inference_mode():
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=2048, 
        do_sample=False,
        repetition_penalty=1.15  # Recommended to prevent looping on dense menus
    )
    
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)

⚡ High-Speed Usage (vLLM)

For production environments and bulk processing, we highly recommend using vLLM to utilize PagedAttention.

from transformers import AutoProcessor
from vllm import LLM, SamplingParams

model_id = "mohamedashraff22/arabic-menu-ocr-v2"
prompt = "..." # (Use the exact same prompt string as above)

# Load Processor & Engine
processor = AutoProcessor.from_pretrained(model_id)
llm = LLM(
    model=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.60
)

# Format for vLLM
vllm_messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]
formatted_prompt = processor.apply_chat_template(vllm_messages, add_generation_prompt=True, tokenize=False)

# Configure Sampling Settings
sampling_params = SamplingParams(
    temperature=0.0, 
    max_tokens=5000,
    repetition_penalty=1.15
)

# Run Inference
outputs = llm.generate({
    "prompt": formatted_prompt,
    "multi_modal_data": {"image": "file:///path/to/your/menu_image.jpg"}
}, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

⚙️ Model Details

Base Architecture: Qwen2.5-VL-3B-Instruct
Parameters: ~3 Billion
Precision: bfloat16
Languages: Arabic (primary target), English (secondary parameters bridging)
Training Method: LoRA fine-tuning (all layers targeted, rank 16) -> Fully Merged

Downloads last month: 3

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for mohamedashraff22/arabic-menu-ocr-v2

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(813)

this model

mohamedashraff22
/

arabic-menu-ocr-v2