AhmedZaky1 commited on
Commit
18c5acb
·
verified ·
1 Parent(s): 2dea9c9

Initial upload: Fine-tuned Qwen2.5-VL Arabic OCR model

Browse files
Files changed (2) hide show
  1. README.md +149 -111
  2. adapter_model.safetensors +1 -1
README.md CHANGED
@@ -1,194 +1,232 @@
1
  ---
2
- base_model: AhmedZaky1/DIMI-Arabic-OCR
3
- library_name: peft
4
  language:
5
  - ar
6
- pipeline_tag: image-text-to-text
7
  tags:
8
- - vision
9
  - ocr
10
  - arabic
11
  - qwen2.5-vl
12
- - lora
13
  - unsloth
14
- - trl
15
- - transformers
16
- license: apache-2.0
17
  datasets:
18
  - oddadmix/qari-0.2.2-news-dataset-large
19
  - oddadmix/qari-0.2.2-diacritics-dataset-large
20
  metrics:
21
  - wer
22
  - cer
 
 
23
  ---
24
 
25
- # DIMI Arabic OCR v2
26
-
27
- <div align="center">
28
 
29
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65fb3ac20cfe262da2bb0fcc/uOuEn0LNhSVEBbOLwfFUu.jpeg" width="300"/>
30
-
31
- *Accurate Arabic OCR model V2 for extracting printed Arabic text from images*
32
-
33
- </div>
34
 
35
  ## Model Description
36
 
37
- **DIMI Arabic OCR v2** is a specialized Arabic Optical Character Recognition model fine-tuned on **Qwen2.5-VL-7B-Instruct** using LoRA adapters. This is the **second iteration**, building upon v1 with improved diacritics handling and enhanced accuracy across diverse Arabic text scenarios.
38
-
39
- - **Developed by:** Ahmed Zaky
40
- - **Base Model:** AhmedZaky1/DIMI-Arabic-OCR (v1)
41
- - **Original Base:** Qwen/Qwen2.5-VL-7B-Instruct
42
- - **Model Type:** Vision-Language Model (VLM) for Arabic OCR
43
- - **Language:** Arabic (ar)
44
  - **License:** Apache 2.0
45
- - **Fine-tuning Method:** LoRA (Low-Rank Adaptation) with 4-bit quantization
46
-
47
- ### Key Improvements Over v1
48
 
49
- **30% reduction in WER** on diacritics-heavy text
50
- ✅ **Enhanced training dataset** with balanced diacritics representation
51
- ✅ **Improved generalization** across news articles and formal documents
52
- ✅ **Better preservation** of text formatting and structure
53
 
54
- ## 📊 Performance Metrics
55
 
56
- ### Test Set Results (500 samples from 2,600)
 
 
 
57
 
58
- | Metric | Score | Description |
59
- |--------|-------|-------------|
60
- | **WER** | 0.3049 | Word Error Rate (↓ lower is better) |
61
- | **CER** | 0.1119 | Character Error Rate (↓ lower is better) |
62
- | **Perfect Predictions** | 23% | Exact matches with ground truth |
63
 
64
- ### Validation Set Results (100 samples)
65
-
66
- | Metric | Score |
67
- |--------|-------|
68
- | **WER** | 0.2315 |
69
- | **CER** | 0.0776 |
 
 
 
 
 
 
 
 
70
 
71
- ### Comparison with v1
72
 
73
- | Model | Test WER | Test CER | Val WER | Val CER |
74
- |-------|----------|----------|---------|---------|
75
- | **v1** | 0.404 | 0.226 | - | - |
76
- | **v2** | **0.3049** ↓ | **0.1119** ↓ | **0.2315** | **0.0776** |
77
 
78
- **Improvements:**
79
- - **WER reduced by ~24.5%** (0.404 → 0.3049)
80
- - **CER reduced by ~50.5%** (0.226 → 0.1119)
81
 
82
- ## 🎯 Intended Use
83
 
84
- ### Direct Use
 
 
85
 
86
- This model is designed for extracting Arabic text from images, including:
87
- - 📰 News articles and printed documents
88
- - 📝 Formal Arabic text with diacritics (تشكيل)
89
- - 🔢 Mixed Arabic text and numbers
90
- - 📄 Scanned documents and screenshots
91
 
92
- ### Example Use Case
93
  ```python
 
 
94
  from unsloth import FastVisionModel
95
  from PIL import Image
96
  import torch
97
 
98
- # Load model
99
  model, tokenizer = FastVisionModel.from_pretrained(
100
- "AhmedZaky1/DIMI-Arabic-OCR-v2",
101
  load_in_4bit=True,
102
- device_map="auto"
103
  )
 
 
104
  FastVisionModel.for_inference(model)
105
 
106
- # Load image
107
- image = Image.open("arabic_document.jpg")
108
 
109
- # Prepare prompt
110
- instruction = "استخرج النص العربي والأرقام الموجودة في هذه الصورة بدقة عالية."
111
 
 
112
  messages = [
113
  {
114
  "role": "user",
115
  "content": [
116
- {"type": "image", "image": image},
117
- {"type": "text", "text": instruction},
118
- ],
119
  }
120
  ]
121
 
122
  # Apply chat template
123
- text = tokenizer.apply_chat_template(
124
- messages, tokenize=False, add_generation_prompt=True
125
- )
126
 
127
- # Tokenize
128
  inputs = tokenizer(
129
- text=[text],
130
- images=[image],
131
- padding=True,
132
  return_tensors="pt",
133
- truncation=False
134
  ).to("cuda")
135
 
136
- # Generate
137
- with torch.inference_mode():
138
  outputs = model.generate(
139
  **inputs,
140
- max_new_tokens=2048,
141
- do_sample=False
 
 
142
  )
143
 
144
- # Decode
145
- generated_ids = [
146
- out[len(inp):] for inp, out in zip(inputs.input_ids, outputs)
147
- ]
148
- prediction = tokenizer.batch_decode(
149
- generated_ids,
150
- skip_special_tokens=True
151
- )[0]
152
 
 
153
  print(prediction)
154
  ```
155
 
156
- ## 🧾 Training Data
157
 
158
- Fine-tuned on **11,000 Arabic text images** combining:
159
- 1. [oddadmix/qari-0.2.2-news-dataset-large](https://huggingface.co/datasets/oddadmix/qari-0.2.2-news-dataset-large)
160
- 2. [oddadmix/qari-0.2.2-diacritics-dataset-large](https://huggingface.co/datasets/oddadmix/qari-0.2.2-diacritics-dataset-large)
161
 
162
- The dataset covers modern standard Arabic with and without diacritics.
 
 
163
 
164
- ---
 
165
 
166
- ## 📚 Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
168
- If you use this model, please cite:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  ```bibtex
171
- @misc{dimi-arabic-ocr-2025,
172
- author = {Ahmed Zaky},
173
- title = {DIMI-Arabic-OCR: Fine-tuned Qwen2.5-VL for Arabic Text Recognition},
174
  year = {2025},
175
  publisher = {Hugging Face},
176
- howpublished = {\url{https://huggingface.co/AhmedZaky1/DIMI-Arabic-OCR}}
 
177
  }
178
  ```
179
 
180
- ---
181
-
182
- ### 🔗 Related Projects
183
- - [DIMI Models Series](https://huggingface.co/AhmedZaky1) — Arabic Vision & Language Models
184
 
 
 
 
 
185
 
186
- ---
187
-
188
- <div align="center">
189
 
190
- **Built with ❤️ by Ahmed Zaky**
 
 
191
 
192
- *Advancing Arabic NLP through state-of-the-art embedding models*
193
 
194
- </div>
 
1
  ---
 
 
2
  language:
3
  - ar
4
+ license: apache-2.0
5
  tags:
 
6
  - ocr
7
  - arabic
8
  - qwen2.5-vl
9
+ - vision-language-model
10
  - unsloth
11
+ - lora
12
+ - fine-tuned
13
+ base_model: unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit
14
  datasets:
15
  - oddadmix/qari-0.2.2-news-dataset-large
16
  - oddadmix/qari-0.2.2-diacritics-dataset-large
17
  metrics:
18
  - wer
19
  - cer
20
+ library_name: transformers
21
+ pipeline_tag: image-to-text
22
  ---
23
 
24
+ # Qwen2.5-VL-7B Arabic OCR Fine-tuned
 
 
25
 
26
+ This model is a fine-tuned version of [unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit) for Arabic Optical Character Recognition (OCR) tasks.
 
 
 
 
27
 
28
  ## Model Description
29
 
30
+ - **Developed by:** AhmedZaky1 (DIMI Models)
31
+ - **Model type:** Vision-Language Model (VLM)
32
+ - **Language(s):** Arabic
 
 
 
 
33
  - **License:** Apache 2.0
34
+ - **Fine-tuned from:** Qwen2.5-VL-7B-Instruct
35
+ - **Training approach:** LoRA (Low-Rank Adaptation)
36
+ - **Quantization:** 4-bit with bitsandbytes
37
 
38
+ ## Training Details
 
 
 
39
 
40
+ ### Training Data
41
 
42
+ The model was fine-tuned on a combination of two high-quality Arabic OCR datasets:
43
+ - **oddadmix/qari-0.2.2-news-dataset-large**: 13,000 samples of Arabic news text
44
+ - **oddadmix/qari-0.2.2-diacritics-dataset-large**: 13,000 samples with diacritics
45
+ - **Total training samples:** ~26,000 images with Arabic text annotations
46
 
47
+ ### Training Configuration
 
 
 
 
48
 
49
+ ```
50
+ - Training epochs: 2
51
+ - Batch size: 12 (per device)
52
+ - Gradient accumulation steps: 4
53
+ - Effective batch size: 48
54
+ - Learning rate: 3e-4
55
+ - Optimizer: AdamW 8-bit
56
+ - LR scheduler: Linear
57
+ - Weight decay: 0.01
58
+ - LoRA rank (r): 16
59
+ - LoRA alpha: 16
60
+ - Max sequence length: 2048
61
+ - Warmup steps: 50
62
+ ```
63
 
64
+ ### Hardware & Optimization
65
 
66
+ - Trained using 4-bit quantization with gradient checkpointing
67
+ - Optimized with Unsloth for memory efficiency
68
+ - Compatible with consumer GPUs (tested on GPU with 16GB+ VRAM)
 
69
 
70
+ ## Usage
 
 
71
 
72
+ ### Installation
73
 
74
+ ```bash
75
+ pip install unsloth transformers pillow torch bitsandbytes
76
+ ```
77
 
78
+ ### Quick Start
 
 
 
 
79
 
 
80
  ```python
81
+ # IMPORTANT: Import unsloth FIRST before any transformers imports!
82
+ import unsloth
83
  from unsloth import FastVisionModel
84
  from PIL import Image
85
  import torch
86
 
87
+ # Load the fine-tuned model
88
  model, tokenizer = FastVisionModel.from_pretrained(
89
+ "AhmedZaky1/qwen2.5-vl-7b-arabic-ocr",
90
  load_in_4bit=True,
91
+ use_gradient_checkpointing="unsloth",
92
  )
93
+
94
+ # Set model to inference mode
95
  FastVisionModel.for_inference(model)
96
 
97
+ # Load your image
98
+ image = Image.open("path_to_your_arabic_image.jpg")
99
 
100
+ # Arabic instruction (you can customize this)
101
+ instruction = "استخرج النص العربي الموجود في هذه الصورة بدقة."
102
 
103
+ # Prepare the conversation messages
104
  messages = [
105
  {
106
  "role": "user",
107
  "content": [
108
+ {"type": "image"},
109
+ {"type": "text", "text": instruction}
110
+ ]
111
  }
112
  ]
113
 
114
  # Apply chat template
115
+ input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
 
 
116
 
117
+ # Tokenize inputs
118
  inputs = tokenizer(
119
+ image,
120
+ input_text,
121
+ add_special_tokens=False,
122
  return_tensors="pt",
 
123
  ).to("cuda")
124
 
125
+ # Generate the OCR output
126
+ with torch.no_grad():
127
  outputs = model.generate(
128
  **inputs,
129
+ max_new_tokens=512,
130
+ do_sample=False,
131
+ pad_token_id=tokenizer.pad_token_id,
132
+ eos_token_id=tokenizer.eos_token_id
133
  )
134
 
135
+ # Decode the prediction
136
+ generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
137
+ prediction = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
 
 
 
 
 
138
 
139
+ print("Extracted Arabic Text:")
140
  print(prediction)
141
  ```
142
 
143
+ ### Alternative Instructions
144
 
145
+ You can use different instructions based on your needs:
 
 
146
 
147
+ ```python
148
+ # For general OCR
149
+ instruction = "استخرج النص العربي الموجود في هذه الصورة بدقة."
150
 
151
+ # For preserving formatting
152
+ instruction = "استخرج النص العربي من الصورة مع الحفاظ على التنسيق والترقيم."
153
 
154
+ # English instruction
155
+ instruction = "Extract all Arabic text from this image accurately, preserving diacritics and formatting."
156
+ ```
157
+
158
+ ## Performance
159
+
160
+ This model is optimized for:
161
+ - High accuracy on printed Arabic text
162
+ - Preserving Arabic diacritics (تشكيل)
163
+ - Maintaining original text formatting
164
+ - Fast inference with 4-bit quantization
165
+
166
+ ### Evaluation Metrics
167
+
168
+ Performance metrics will be updated based on validation:
169
+ - **WER (Word Error Rate):** TBD
170
+ - **CER (Character Error Rate):** TBD
171
+
172
+ ## Intended Use Cases
173
+
174
+ ✅ **Recommended for:**
175
+ - Extracting Arabic text from documents and images
176
+ - OCR on Arabic newspapers, books, and printed materials
177
+ - Digitizing Arabic text with diacritics
178
+ - Processing Arabic signage and labels
179
+ - Educational and research applications
180
+
181
+ ⚠️ **Limitations:**
182
+ - Primarily optimized for printed text
183
+ - Handwritten text recognition may vary in accuracy
184
+ - Best results with clear, well-lit, high-contrast images
185
+ - Requires GPU for optimal inference speed
186
+
187
+ ## Model Architecture
188
 
189
+ This model uses the Qwen2.5-VL architecture with:
190
+ - Vision encoder for image processing
191
+ - Language model for text generation
192
+ - LoRA adapters for efficient fine-tuning
193
+ - 4-bit quantization for memory efficiency
194
+
195
+ ## Training Process
196
+
197
+ 1. **Data Preparation:** Images preprocessed and converted to conversation format
198
+ 2. **Fine-tuning:** LoRA fine-tuning on both vision and language layers
199
+ 3. **Optimization:** Unsloth optimizations for faster training
200
+ 4. **Evaluation:** Character Error Rate (CER) and Word Error Rate (WER) metrics
201
+
202
+ ## Citation
203
+
204
+ If you use this model in your research or applications, please cite:
205
 
206
  ```bibtex
207
+ @misc{qwen2.5-vl-arabic-ocr-2025,
208
+ author = {AhmedZaky1},
209
+ title = {Qwen2.5-VL-7B Arabic OCR Fine-tuned},
210
  year = {2025},
211
  publisher = {Hugging Face},
212
+ journal = {Hugging Face Model Hub},
213
+ howpublished = {\url{https://huggingface.co/AhmedZaky1/qwen2.5-vl-7b-arabic-ocr}}
214
  }
215
  ```
216
 
217
+ ## Acknowledgments
 
 
 
218
 
219
+ - **Base Model:** [Qwen2.5-VL by Alibaba Cloud](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
220
+ - **Training Framework:** [Unsloth](https://github.com/unslothai/unsloth) for optimized training
221
+ - **Datasets:** oddadmix/qari Arabic OCR datasets
222
+ - **Quantization:** bitsandbytes for 4-bit quantization
223
 
224
+ ## Contact & Support
 
 
225
 
226
+ - **Model Repository:** https://huggingface.co/AhmedZaky1/qwen2.5-vl-7b-arabic-ocr
227
+ - **Issues:** Please report issues on the model repository
228
+ - **Developer:** AhmedZaky1
229
 
230
+ ## License
231
 
232
+ This model is released under the Apache 2.0 license. See the LICENSE file for details.
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8b8d32a8fbc8abf066070a11a1f3fae6d5e381fb1cc22793268df1b68c4e702e
3
  size 206188832
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0846e8e7a199c4309cef8ef325aa76d185637171389859e5548a8ac59dc7abcd
3
  size 206188832