AhmedZaky1
/

DIMI-Arabic-OCR

@@ -1,194 +1,232 @@
 ---
-base_model: AhmedZaky1/DIMI-Arabic-OCR
-library_name: peft
 language:
 - ar
-pipeline_tag: image-text-to-text
 tags:
-- vision
 - ocr
 - arabic
 - qwen2.5-vl
-- lora
 - unsloth
-- trl
-- transformers
-license: apache-2.0
 datasets:
 - oddadmix/qari-0.2.2-news-dataset-large
 - oddadmix/qari-0.2.2-diacritics-dataset-large
 metrics:
 - wer
 - cer
 ---
-# DIMI Arabic OCR v2
-<div align="center">
-<img src="https://cdn-uploads.huggingface.co/production/uploads/65fb3ac20cfe262da2bb0fcc/uOuEn0LNhSVEBbOLwfFUu.jpeg" width="300"/>
-*Accurate Arabic OCR model V2 for extracting printed Arabic text from images*
-</div>
 ## Model Description
-**DIMI Arabic OCR v2** is a specialized Arabic Optical Character Recognition model fine-tuned on **Qwen2.5-VL-7B-Instruct** using LoRA adapters. This is the **second iteration**, building upon v1 with improved diacritics handling and enhanced accuracy across diverse Arabic text scenarios.
-- **Developed by:** Ahmed Zaky
-- **Base Model:** AhmedZaky1/DIMI-Arabic-OCR (v1)
-- **Original Base:** Qwen/Qwen2.5-VL-7B-Instruct
-- **Model Type:** Vision-Language Model (VLM) for Arabic OCR
-- **Language:** Arabic (ar)
 - **License:** Apache 2.0
-- **Fine-tuning Method:** LoRA (Low-Rank Adaptation) with 4-bit quantization
-### Key Improvements Over v1
-✅ **30% reduction in WER** on diacritics-heavy text
-✅ **Enhanced training dataset** with balanced diacritics representation
-✅ **Improved generalization** across news articles and formal documents
-✅ **Better preservation** of text formatting and structure
-## 📊 Performance Metrics
-### Test Set Results (500 samples from 2,600)
-| Metric | Score | Description |
-|--------|-------|-------------|
-| **WER** | 0.3049 | Word Error Rate (↓ lower is better) |
-| **CER** | 0.1119 | Character Error Rate (↓ lower is better) |
-| **Perfect Predictions** | 23% | Exact matches with ground truth |
-### Validation Set Results (100 samples)
-| Metric | Score |
-|--------|-------|
-| **WER** | 0.2315 |
-| **CER** | 0.0776 |
-### Comparison with v1
-| Model | Test WER | Test CER | Val WER | Val CER |
-|-------|----------|----------|---------|---------|
-| **v1** | 0.404 | 0.226 | - | - |
-| **v2** | **0.3049** ↓ | **0.1119** ↓ | **0.2315** | **0.0776** |
-**Improvements:**
-- **WER reduced by ~24.5%** (0.404 → 0.3049)
-- **CER reduced by ~50.5%** (0.226 → 0.1119)
-## 🎯 Intended Use
-### Direct Use
-This model is designed for extracting Arabic text from images, including:
-- 📰 News articles and printed documents
-- 📝 Formal Arabic text with diacritics (تشكيل)
-- 🔢 Mixed Arabic text and numbers
-- 📄 Scanned documents and screenshots
-### Example Use Case
 ```python
 from unsloth import FastVisionModel
 from PIL import Image
 import torch
-# Load model
 model, tokenizer = FastVisionModel.from_pretrained(
-    "AhmedZaky1/DIMI-Arabic-OCR-v2",
     load_in_4bit=True,
-    device_map="auto"
 )
 FastVisionModel.for_inference(model)
-# Load image
-image = Image.open("arabic_document.jpg")
-# Prepare prompt
-instruction = "استخرج النص العربي والأرقام الموجودة في هذه الصورة بدقة عالية."
 messages = [
     {
         "role": "user",
         "content": [
-            {"type": "image", "image": image},
-            {"type": "text", "text": instruction},
-        ],
     }
 ]
 # Apply chat template
-text = tokenizer.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
-)
-# Tokenize
 inputs = tokenizer(
-    text=[text],
-    images=[image],
-    padding=True,
     return_tensors="pt",
-    truncation=False
 ).to("cuda")
-# Generate
-with torch.inference_mode():
     outputs = model.generate(
         **inputs,
-        max_new_tokens=2048,
-        do_sample=False
     )
-# Decode
-generated_ids = [
-    out[len(inp):] for inp, out in zip(inputs.input_ids, outputs)
-]
-prediction = tokenizer.batch_decode(
-    generated_ids,
-    skip_special_tokens=True
-)[0]
 print(prediction)
 ```
-## 🧾 Training Data
-Fine-tuned on **11,000 Arabic text images** combining:
-1. [oddadmix/qari-0.2.2-news-dataset-large](https://huggingface.co/datasets/oddadmix/qari-0.2.2-news-dataset-large)
-2. [oddadmix/qari-0.2.2-diacritics-dataset-large](https://huggingface.co/datasets/oddadmix/qari-0.2.2-diacritics-dataset-large)
-The dataset covers modern standard Arabic with and without diacritics.
----
-## 📚 Citation
-If you use this model, please cite:
 ```bibtex
-@misc{dimi-arabic-ocr-2025,
-  author = {Ahmed Zaky},
-  title = {DIMI-Arabic-OCR: Fine-tuned Qwen2.5-VL for Arabic Text Recognition},
   year = {2025},
   publisher = {Hugging Face},
-  howpublished = {\url{https://huggingface.co/AhmedZaky1/DIMI-Arabic-OCR}}
 }
 ```
----
-### 🔗 Related Projects
-- [DIMI Models Series](https://huggingface.co/AhmedZaky1) — Arabic Vision & Language Models
----
-<div align="center">
-**Built with ❤️ by Ahmed Zaky**
-*Advancing Arabic NLP through state-of-the-art embedding models*
-</div>

 ---
 language:
 - ar
+license: apache-2.0
 tags:
 - ocr
 - arabic
 - qwen2.5-vl
+- vision-language-model
 - unsloth
+- lora
+- fine-tuned
+base_model: unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit
 datasets:
 - oddadmix/qari-0.2.2-news-dataset-large
 - oddadmix/qari-0.2.2-diacritics-dataset-large
 metrics:
 - wer
 - cer
+library_name: transformers
+pipeline_tag: image-to-text
 ---
+# Qwen2.5-VL-7B Arabic OCR Fine-tuned
+This model is a fine-tuned version of [unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit) for Arabic Optical Character Recognition (OCR) tasks.
 ## Model Description
+- **Developed by:** AhmedZaky1 (DIMI Models)
+- **Model type:** Vision-Language Model (VLM)
+- **Language(s):** Arabic
 - **License:** Apache 2.0
+- **Fine-tuned from:** Qwen2.5-VL-7B-Instruct
+- **Training approach:** LoRA (Low-Rank Adaptation)
+- **Quantization:** 4-bit with bitsandbytes
+## Training Details
+### Training Data
+The model was fine-tuned on a combination of two high-quality Arabic OCR datasets:
+- **oddadmix/qari-0.2.2-news-dataset-large**: 13,000 samples of Arabic news text
+- **oddadmix/qari-0.2.2-diacritics-dataset-large**: 13,000 samples with diacritics
+- **Total training samples:** ~26,000 images with Arabic text annotations
+### Training Configuration
+```
+- Training epochs: 2
+- Batch size: 12 (per device)
+- Gradient accumulation steps: 4
+- Effective batch size: 48
+- Learning rate: 3e-4
+- Optimizer: AdamW 8-bit
+- LR scheduler: Linear
+- Weight decay: 0.01
+- LoRA rank (r): 16
+- LoRA alpha: 16
+- Max sequence length: 2048
+- Warmup steps: 50
+```
+### Hardware & Optimization
+- Trained using 4-bit quantization with gradient checkpointing
+- Optimized with Unsloth for memory efficiency
+- Compatible with consumer GPUs (tested on GPU with 16GB+ VRAM)
+## Usage
+### Installation
+```bash
+pip install unsloth transformers pillow torch bitsandbytes
+```
+### Quick Start
 ```python
+# IMPORTANT: Import unsloth FIRST before any transformers imports!
+import unsloth
 from unsloth import FastVisionModel
 from PIL import Image
 import torch
+# Load the fine-tuned model
 model, tokenizer = FastVisionModel.from_pretrained(
+    "AhmedZaky1/qwen2.5-vl-7b-arabic-ocr",
     load_in_4bit=True,
+    use_gradient_checkpointing="unsloth",
 )
+# Set model to inference mode
 FastVisionModel.for_inference(model)
+# Load your image
+image = Image.open("path_to_your_arabic_image.jpg")
+# Arabic instruction (you can customize this)
+instruction = "استخرج النص العربي الموجود في هذه الصورة بدقة."
+# Prepare the conversation messages
 messages = [
     {
         "role": "user",
         "content": [
+            {"type": "image"},
+            {"type": "text", "text": instruction}
+        ]
     }
 ]
 # Apply chat template
+input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
+# Tokenize inputs
 inputs = tokenizer(
+    image,
+    input_text,
+    add_special_tokens=False,
     return_tensors="pt",
 ).to("cuda")
+# Generate the OCR output
+with torch.no_grad():
     outputs = model.generate(
         **inputs,
+        max_new_tokens=512,
+        do_sample=False,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id
     )
+# Decode the prediction
+generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
+prediction = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
+print("Extracted Arabic Text:")
 print(prediction)
 ```
+### Alternative Instructions
+You can use different instructions based on your needs:
+```python
+# For general OCR
+instruction = "استخرج النص العربي الموجود في هذه الصورة بدقة."
+# For preserving formatting
+instruction = "استخرج النص العربي من الصورة مع الحفاظ على التنسيق والترقيم."
+# English instruction
+instruction = "Extract all Arabic text from this image accurately, preserving diacritics and formatting."
+```
+## Performance
+This model is optimized for:
+- High accuracy on printed Arabic text
+- Preserving Arabic diacritics (تشكيل)
+- Maintaining original text formatting
+- Fast inference with 4-bit quantization
+### Evaluation Metrics
+Performance metrics will be updated based on validation:
+- **WER (Word Error Rate):** TBD
+- **CER (Character Error Rate):** TBD
+## Intended Use Cases
+✅ **Recommended for:**
+- Extracting Arabic text from documents and images
+- OCR on Arabic newspapers, books, and printed materials
+- Digitizing Arabic text with diacritics
+- Processing Arabic signage and labels
+- Educational and research applications
+⚠️ **Limitations:**
+- Primarily optimized for printed text
+- Handwritten text recognition may vary in accuracy
+- Best results with clear, well-lit, high-contrast images
+- Requires GPU for optimal inference speed
+## Model Architecture
+This model uses the Qwen2.5-VL architecture with:
+- Vision encoder for image processing
+- Language model for text generation
+- LoRA adapters for efficient fine-tuning
+- 4-bit quantization for memory efficiency
+## Training Process
+1. **Data Preparation:** Images preprocessed and converted to conversation format
+2. **Fine-tuning:** LoRA fine-tuning on both vision and language layers
+3. **Optimization:** Unsloth optimizations for faster training
+4. **Evaluation:** Character Error Rate (CER) and Word Error Rate (WER) metrics
+## Citation
+If you use this model in your research or applications, please cite:
 ```bibtex
+@misc{qwen2.5-vl-arabic-ocr-2025,
+  author = {AhmedZaky1},
+  title = {Qwen2.5-VL-7B Arabic OCR Fine-tuned},
   year = {2025},
   publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{https://huggingface.co/AhmedZaky1/qwen2.5-vl-7b-arabic-ocr}}
 }
 ```
+## Acknowledgments
+- **Base Model:** [Qwen2.5-VL by Alibaba Cloud](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
+- **Training Framework:** [Unsloth](https://github.com/unslothai/unsloth) for optimized training
+- **Datasets:** oddadmix/qari Arabic OCR datasets
+- **Quantization:** bitsandbytes for 4-bit quantization
+## Contact & Support
+- **Model Repository:** https://huggingface.co/AhmedZaky1/qwen2.5-vl-7b-arabic-ocr
+- **Issues:** Please report issues on the model repository
+- **Developer:** AhmedZaky1
+## License
+This model is released under the Apache 2.0 license. See the LICENSE file for details.

adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8b8d32a8fbc8abf066070a11a1f3fae6d5e381fb1cc22793268df1b68c4e702e
 size 206188832

 version https://git-lfs.github.com/spec/v1
+oid sha256:0846e8e7a199c4309cef8ef325aa76d185637171389859e5548a8ac59dc7abcd
 size 206188832