File size: 6,076 Bytes
1319b4d
9db846e
 
1319b4d
 
 
 
 
491a25f
1319b4d
 
99c4b6a
01b5f02
 
 
 
491a25f
960646b
1319b4d
 
4382e6b
99c4b6a
 
491a25f
777c8aa
99c4b6a
e519319
491a25f
 
 
99c4b6a
 
 
 
 
 
43cb391
 
99c4b6a
b09ae1a
 
 
 
 
 
 
 
8b4b867
 
 
 
bd0e9ce
b09ae1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99c4b6a
 
 
 
 
7549ce6
99c4b6a
7807b82
99c4b6a
 
7549ce6
6076f34
43cb391
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99c4b6a
 
e2b1655
 
 
9e2a749
e2b1655
 
 
 
99c4b6a
 
9e2a749
 
99c4b6a
9e2a749
 
 
 
c6a2a16
7ebf53b
c6a2a16
 
 
 
 
 
 
 
9e2a749
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99c4b6a
 
 
 
 
 
 
491a25f
 
 
 
c2ce728
0a62184
c2ce728
 
 
491a25f
9db846e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
base_model:
- unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit
tags:
- transformers
- unsloth
- qwen2_vl
- trl
- ocr
license: apache-2.0
language:
- ar
metrics:
- bleu
- wer
- cer
pipeline_tag: image-text-to-text
library_name: peft
---

# Qari-OCR-0.1-VL-2B-Instruct Model

## Model Overview

This model is a fine-tuned version of [unsloth/Qwen2-VL-2B-Instruct](https://huggingface.co/unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit) on an Arabic OCR dataset. It is optimized to perform Arabic Optical Character Recognition (OCR) for full-page text.

- It is described in detail in the paper [QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation](https://huggingface.co/papers/2506.02295).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/HuUcfziXcDT_2kwDoz5qH.png)

## Model Details
- **Base Model**: Qwen2 VL
- **Fine-tuning Dataset**: Arabic OCR dataset
- **Objective**: Extract full-page Arabic text with high accuracy
- **Languages**: Arabic
- **Tasks**: OCR (Optical Character Recognition)
- **Dataset size**: 5000 records
- **Epochs**: 1

## Performance Evaluation

The model has been evaluated on standard OCR metrics, including Word Error Rate (WER), Character Error Rate (CER), and BLEU score.

### Metrics

| Model | WER ↓ | CER ↓ | BLEU ↑ |
|-------|-------|-------|--------|
| Qari v0.1 Model | 0.068 | 0.019 | 0.860 |
| Qwen2 VL 2B | 1.344 | 1.191 | 0.201 |
| EasyOCR | 0.908 | 0.617 | 0.152 |
| Tesseract OCR | 0.428 | 0.226 | 0.410 |


### Key Results

- **WER:** 0.068 (93.2% word accuracy)
- **CER:** 0.019 (98.1% character accuracy)
- **BLEU:** 0.860

### Performance Comparison

The Fine-Tuned Model outperforms other solutions with:
- 95% reduction in WER compared to Base Model
- 98% reduction in CER compared to Base Model
- 328% improvement in BLEU score compared to Base Model
- 84% lower WER than Tesseract OCR
- 92% lower WER than EasyOCR

## Performance Comparison Charts

### WER & CER Comparison

<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/8fk27_Xs_V60WyTLlu31N.png" width="400px"/>

### BLEU Score Comparison


<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/vFvN7REyy-jfgulwoC6Yy.png" width="400px"/>

## Limitations  

While the Arabic OCR model demonstrates strong performance under specific conditions, it has several limitations:  

1. **Font Dependency**: The model was trained using a limited set of fonts (*Almarai-Regular, Amiri-Regular, Cairo-Regular, Tajawal-Regular, and NotoNaskhArabic-Regular*). As a result, its accuracy may degrade when processing text in other fonts, particularly decorative or stylized typefaces.  

2. **Font Size Restriction**: Training was conducted with a fixed font size of *16*. Variations in font size, especially very small or large text, may reduce recognition accuracy.  

3. **Diacritics Exclusion**: The model does not support Arabic diacritics (*Tashkeel*). Text that relies on diacritics for disambiguation may not be correctly recognized.  

4. **Lack of Handwriting Support**: The model is not trained to recognize handwritten text, limiting its applicability to printed documents only.  

5. **Full-Page Processing**: The model was trained on full-page text recognition, which may impact its performance on segmented text, cropped sections, or text within complex layouts such as tables and multi-column formats.  

These limitations should be considered when deploying the model in real-world applications to ensure optimal performance.



## How to Use

[Try Qari - Google Colab](https://colab.research.google.com/github/NAMAA-ORG/public-notebooks/blob/main/Qari_Free_Colab.ipynb)

You can load this model using the `transformers` and `qwen_vl_utils` library:
```
!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes
```

```python
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info



model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

```

## License
This model follows the licensing terms of the original Qwen2 VL model. Please review the terms before using it commercially.

## Citation

If you use this model in your research, please cite:

```
@article{wasfy2025qari,
  title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
  author={Wasfy, Ahmed and Nacar, Omer and Elkhateb, Abdelakreem and Reda, Mahmoud and Elshehy, Omar and Ammar, Adel and Boulila, Wadii},
  journal={arXiv preprint arXiv:2506.02295},
  year={2025}
}
```