Omartificial-Intelligence-Space commited on
Commit
c51b483
Β·
verified Β·
1 Parent(s): f2e3648

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -10
README.md CHANGED
@@ -2,25 +2,125 @@
2
  library_name: transformers
3
  tags:
4
  - image-to-text
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- # Model Card for Model ID
8
 
9
- <!-- Provide a quick summary of what the model is/does. -->
10
- This model is designed for Arabic Optical Character Recognition (OCR).
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
 
16
 
 
 
 
 
 
 
17
 
18
- ### Model Sources [optional]
19
 
 
 
 
 
 
 
 
20
 
21
- - **Repository:** [More Information Needed]
22
- - **Paper:** [QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation](https://huggingface.co/papers/2506.02295)
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
 
26
  **BibTeX:**
@@ -32,5 +132,4 @@ This model is designed for Arabic Optical Character Recognition (OCR).
32
  journal={arXiv preprint arXiv:2506.02295},
33
  year={2025}
34
  }
35
- ```
36
-
 
2
  library_name: transformers
3
  tags:
4
  - image-to-text
5
+ license: apache-2.0
6
+ datasets:
7
+ - NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset
8
+ language:
9
+ - ar
10
+ metrics:
11
+ - wer
12
+ - cer
13
+ - bleu
14
+ pipeline_tag: image-text-to-text
15
  ---
16
 
17
+ # QARI-OCR v0.3: Structural Arabic Document Understanding
18
 
19
+ <div align="center">
20
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/Txz_HjVy6NsdmcghXqVH_.png" alt="QARI Logo" width="400">
21
+ </div>
22
 
23
+ ## Model Description
24
 
25
+ QARI-OCR v0.3 is a specialized vision-language model fine-tuned for Arabic Optical Character Recognition with a focus on **structural document understanding**. Built on Qwen2-VL-2B-Instruct, this model excels at preserving document layouts, HTML tags, and formatting while transcribing Arabic text.
26
 
27
+ ### Key Features
28
 
29
+ - πŸ“ **Layout-Aware Recognition**: Preserves document structure with HTML/Markdown tags
30
+ - πŸ”€ **Full Diacritics Support**: Accurate recognition of tashkeel (Arabic diacritical marks)
31
+ - πŸ“ **Multi-Font Handling**: Trained on 12 diverse Arabic fonts (14px-100px)
32
+ - 🎯 **Structure-First Design**: Optimized for documents with headers, body text, and complex layouts
33
+ - ⚑ **Efficient Training**: Only 11 hours on single GPU with 10k samples
34
+ - πŸ–ΌοΈ **Robust Performance**: Handles low-resolution and degraded images
35
 
36
+ ## Model Performance
37
 
38
+ | Metric | Score |
39
+ |--------|-------|
40
+ | **Character Error Rate (CER)** | 0.300 |
41
+ | **Word Error Rate (WER)** | 0.485 |
42
+ | **BLEU Score** | 0.545 |
43
+ | **Training Time** | 11 hours |
44
+ | **COβ‚‚ Emissions** | 1.88 kg eq. |
45
 
46
+ ### Comparative Strengths
 
47
 
48
+ While QARI v0.2 achieves better raw text accuracy (CER: 0.061), QARI v0.3 excels in:
49
+ - βœ… **HTML/Markdown structure preservation**
50
+ - βœ… **Document layout understanding**
51
+ - βœ… **Handwritten text recognition** (initial capabilities)
52
+ - βœ… **5x faster training** than v0.2
53
+
54
+ ## How to Use
55
+
56
+ [Try Qari - Google Colab](https://colab.research.google.com/github/NAMAA-ORG/public-notebooks/blob/main/Qari_Free_Colab.ipynb)
57
+
58
+ You can load this model using the `transformers` and `qwen_vl_utils` library:
59
+ ```
60
+ !pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
61
+ !pip install -U bitsandbytes
62
+ ```
63
+
64
+ ```python
65
+ from PIL import Image
66
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
67
+ import torch
68
+ import os
69
+ from qwen_vl_utils import process_vision_info
70
+
71
+
72
+
73
+ model_name = "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct"
74
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
75
+ model_name,
76
+ torch_dtype="auto",
77
+ device_map="auto"
78
+ )
79
+ processor = AutoProcessor.from_pretrained(model_name)
80
+ max_tokens = 2000
81
+
82
+ prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
83
+ image.save("image.png")
84
+
85
+ messages = [
86
+ {
87
+ "role": "user",
88
+ "content": [
89
+ {"type": "image", "image": f"file://{src}"},
90
+ {"type": "text", "text": prompt},
91
+ ],
92
+ }
93
+ ]
94
+ text = processor.apply_chat_template(
95
+ messages, tokenize=False, add_generation_prompt=True
96
+ )
97
+ image_inputs, video_inputs = process_vision_info(messages)
98
+ inputs = processor(
99
+ text=[text],
100
+ images=image_inputs,
101
+ videos=video_inputs,
102
+ padding=True,
103
+ return_tensors="pt",
104
+ )
105
+ inputs = inputs.to("cuda")
106
+ generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
107
+ generated_ids_trimmed = [
108
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
109
+ ]
110
+ output_text = processor.batch_decode(
111
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
112
+ )[0]
113
+ os.remove(src)
114
+ print(output_text)
115
+ ```
116
+
117
+ ## Training Details
118
+
119
+ - Base Model: Qwen2-VL-2B-Instruct
120
+ - Training Data: 10,000 synthetic Arabic documents with HTML markup
121
+ - Optimization: 4-bit LoRA adapters (rank=16)
122
+ - Hardware: Single NVIDIA A6000 GPU (48GB)
123
+ - Framework: Unsloth + Hugging Face TRL
124
 
125
 
126
  **BibTeX:**
 
132
  journal={arXiv preprint arXiv:2506.02295},
133
  year={2025}
134
  }
135
+ ```