File size: 4,724 Bytes
01c2059
 
27611c4
 
c51b483
 
 
 
 
 
 
 
 
 
01c2059
 
c51b483
01c2059
c51b483
 
 
01c2059
c51b483
01c2059
b1abf12
 
 
01c2059
c51b483
01c2059
c51b483
 
 
 
 
 
01c2059
c51b483
01c2059
c51b483
 
 
 
 
 
 
01c2059
c51b483
01c2059
c51b483
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dabe11e
faa4703
c51b483
 
 
 
 
 
 
01c2059
 
 
 
27611c4
f2e3648
27611c4
f2e3648
 
 
27611c4
c51b483
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
library_name: transformers
tags:
- image-to-text
license: apache-2.0
datasets:
- NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset
language:
- ar
metrics:
- wer
- cer
- bleu
pipeline_tag: image-text-to-text
---

# QARI-OCR v0.3: Structural Arabic Document Understanding

<div align="center">
 <img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/Txz_HjVy6NsdmcghXqVH_.png" alt="QARI Logo" width="400">
</div>

## Model Description

- QARI-OCR v0.3 is a specialized vision-language model fine-tuned for Arabic Optical Character Recognition with a focus on **structural document understanding**.
- Built on Qwen2-VL-2B-Instruct, this model excels at preserving document layouts, HTML tags, and formatting while transcribing Arabic text.
- It is described in detail in the paper [QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation](https://huggingface.co/papers/2506.02295).

### Key Features

- πŸ“ **Layout-Aware Recognition**: Preserves document structure with HTML/Markdown tags
- πŸ”€ **Full Diacritics Support**: Accurate recognition of tashkeel (Arabic diacritical marks)
- πŸ“ **Multi-Font Handling**: Trained on 12 diverse Arabic fonts (14px-100px)
- 🎯 **Structure-First Design**: Optimized for documents with headers, body text, and complex layouts
- ⚑ **Efficient Training**: Only 11 hours on single GPU with 10k samples
- πŸ–ΌοΈ **Robust Performance**: Handles low-resolution and degraded images

## Model Performance

| Metric | Score |
|--------|-------|
| **Character Error Rate (CER)** | 0.300 |
| **Word Error Rate (WER)** | 0.485 |
| **BLEU Score** | 0.545 |
| **Training Time** | 11 hours |
| **COβ‚‚ Emissions** | 1.88 kg eq. |

### Comparative Strengths

While QARI v0.2 achieves better raw text accuracy (CER: 0.061), QARI v0.3 excels in:
- βœ… **HTML/Markdown structure preservation**
- βœ… **Document layout understanding**
- βœ… **Handwritten text recognition** (initial capabilities)
- βœ… **5x faster training** than v0.2

## How to Use

[Try Qari - Google Colab](https://colab.research.google.com/github/NAMAA-ORG/public-notebooks/blob/main/Qari_Free_Colab.ipynb)

You can load this model using the `transformers` and `qwen_vl_utils` library:
```
!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes
```

```python
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info



model_name = "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)
```

Try the model on Google Colab, [Notebook](https://colab.research.google.com/github/NAMAA-ORG/public-notebooks/blob/main/Qari_V0_3_Free_Colab.ipynb)

## Training Details

- Base Model: Qwen2-VL-2B-Instruct
- Training Data: 10,000 synthetic Arabic documents with HTML markup
- Optimization: 4-bit LoRA adapters (rank=16)
- Hardware: Single NVIDIA A6000 GPU (48GB)
- Framework: Unsloth + Hugging Face TRL


**BibTeX:**

```
@article{wasfy2025qari,
  title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
  author={Wasfy, Ahmed and Nacar, Omer and Elkhateb, Abdelakreem and Reda, Mahmoud and Elshehy, Omar and Ammar, Adel and Boulila, Wadii},
  journal={arXiv preprint arXiv:2506.02295},
  year={2025}
}
```