AtlasOCR / README.md
imomayiz's picture
Update README.md
f479df6 verified
---
tags:
- vision-language-model
- ocr
- darija
- moroccan-arabic
- open-source
- fine-tuning
- qwen2-vl
- unsloth
- qlora
- text-recognition
- computer-vision
- arabic
- natural-language-processing
- atlasocr
library_name: transformers
datasets:
- atlasia/atlasOCR-data
inference: true
---
# AtlasOCR: The First Open-Source Darija OCR Model
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/NiNo2NY_RpsVbZq7w3A7p.png" width=350 height=300/>
</center>
## Model Description
**AtlasOCR** is the first open-source Optical Character Recognition (OCR) model specifically designed for **Darija (Moroccan Arabic)**. It is built by fine-tuning the **Qwen2.5-VL 3B** Vision Language Model (VLM) using a comprehensive dataset of synthetic and real-world Darija text. AtlasOCR excels at extracting text from images, supporting a wide range of applications from digital preservation to social media analysis and accessibility for Moroccan content.
- **Blogpost:** [AtlasOCR BlogPost](https://huggingface.co/blog/imomayiz/atlasocr)
- **Demo:** [AtlasOCR-demo](https://huggingface.co/spaces/atlasia/AtlasOCR-demo)
- **Dataset:** [`atlasia/atlasOCR-data`](https://huggingface.co/datasets/atlasia/atlasOCR-data)
- **AtlasOCRBench:** [`atlasia/AtlasOCRBench`](https://huggingface.co/datasets/atlasia/AtlasOCRBench)
### Key Features:
- **First Open-Source Darija OCR:** Addresses a critical gap for developers and organizations working with Moroccan content.
- **Vision Language Model (VLM) Based:** Leverages the power of VLMs to interpret both visual layout and linguistic context.
- **Efficient Fine-tuning:** Utilizes QLoRA and Unsloth for parameter-efficient training, making it accessible on limited hardware.
- **State-of-the-Art Performance:** Achieves high accuracy on Darija text and generalizes well to standard Arabic OCR tasks.
- **Comprehensive Data Curation:** Trained on a unique dataset combining synthetic data from [OCRSmith](https://github.com/atlasia-ma/OCRSmith) and curated real-world sources (scanned books, social media, educational documents, cookbooks).
## Intended Use
AtlasOCR is intended for:
- **Text Extraction:** Extracting Darija text from images, including social media posts, handwritten notes, scanned documents, and other visual content.
- **Digital Preservation:** Converting historical Moroccan documents and manuscripts into digital, searchable formats.
- **Social Media Analysis:** Understanding public discourse and sentiment in Darija-speaking communities.
- **Accessibility:** Making visual content accessible to screen readers for individuals with visual impairments.
- **Research:** Enabling large-scale text analysis of Moroccan content for linguistic and social studies.
- **As a Base Model:** Further fine-tuning for specialized Darija OCR tasks or other VLM applications.
## Limitations
- **Diacritics Handling:** The model is primarily trained and evaluated on undiacritized text. Its performance on accurately recognizing or reconstructing Arabic diacritics (harakat) may vary.
- **Complex Layouts:** While robust to many layouts, performance may degrade on highly complex, non-standard, or extremely cluttered document structures.
- **Language Specificity:** Optimized for Darija and standard Arabic script. Performance on other Arabic dialects or languages using different scripts may not be optimal.
## Model Details
### Model Architecture
AtlasOCR is based on the **Qwen2.5-VL 3B** architecture, which is a Vision Language Model (VLM). VLMs consist of three main components:
1. **Vision Encoder:** Converts images into vector embeddings capturing visual properties.
2. **Modality Projection Module:** Aligns visual features with the language model's representation space.
3. **Language Model:** Processes aligned embeddings and text input to generate natural language outputs.
### Training Data
The model was fine-tuned on a unique and extensive dataset of Darija text, totaling **30,092 samples** and **10.7 million words**. The dataset composition is:
- **~86% Synthetic Data:** Generated using **OCRSmith**, an open-source toolkit for simulating real-world conditions (fonts, layouts, backgrounds, distortions).
- **~14% Real-World Data:** Curated from diverse sources:
- Scanned Darija books (e.g., *العَرَبِيَّةُ الدَّارِجَةُ* by Mohammed El-Madlaoui El-Mounabhi, *علشان الصغيرة والصغير* by Farouk ElMarrakchi).
- Social media images (poster-style PDFs with educational material).
- Educational documents (e.g., driving license exams).
- Cookbooks (scanned recipes in Darija).
Real-world data was pseudo-labeled using Gemini 2.0 Flash and then human-annotated using Argilla for quality control.
### Training Strategy
- **Base Model:** Qwen2.5-VL 3B
- **Parameter-Efficient Fine-tuning:**
- **QLoRA (Quantized Low-Rank Adaptation):** Enabled fine-tuning of the 4-bit quantized model, significantly reducing memory requirements.
- **Unsloth:** Accelerated training by up to 5x and reduced memory usage by 60% through optimized GPU kernels.
- **Key Hyperparameters (from ablation studies):**
- LoRA Rank (r) and Alpha (α): 128
- LoRA Dropout: 0.05
- Precision: 4-bit quantization
- Learning Rate: 2e-4 (with batch size 16 and gradient accumulation)
- Vision Layer Freezing: No (vision layers were fine-tuned for better performance).
- RSLoRA: Not enabled (showed degradation in performance for this task).
### Evaluation
AtlasOCR was evaluated using **Character Error Rate (CER)** and **Word Error Rate (WER)** on two benchmarks:
1. **AtlasOCRBench (Proprietary, available on Hugging Face):**
- **Composition:** 251 samples, including 55 from scanned Darija books and synthetic data from OCRSmith.
- **Curation:** Two-step pseudo-labeling with Gemini 2.0 Flash and human annotation with Argilla.
- **Normalization:** Removal of Arabic diacritics and whitespace normalization before metric calculation.
- **Primary Metric:** CER, as it better reflects accuracy in Darija due to its non-standardized spelling.
2. **KITAB-Bench (Public):**
- A large-scale, multi-domain benchmark for Arabic OCR and document understanding (8,800+ samples).
#### Evaluation Results
**AtlasBench Performance**
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/H2zllMtjgsG_vO49HiBPJ.png" width="700" height="700">
</center>
**KitabBench Performance**
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/GrbnqdSAT__bRzkKs_1lX.png" width="700" height="800">
</center>
> AtlasOCR demonstrates strong performance on both Darija-specific challenges and general Arabic OCR tasks, competing effectively with larger models.
## How to Use
### Installation
```bash
pip install unsloth
```
### Inference
```python
import os
from PIL import Image
from unsloth import FastVisionModel
import torch
class AtlasOCR:
def __init__(self, model_name: str="atlasia/AtlasOCR-v0", max_tokens: int=2000):
self.model, self.processor = FastVisionModel.from_pretrained(
model_name,
device_map="auto",
load_in_4bit=True,
use_gradient_checkpointing="unsloth"
)
self.max_tokens = max_tokens
self.prompt = "Extract the text in the image. Give me the final text, nothing else."
def prepare_inputs(self,image:Image):
messages = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": self.prompt},
],
}
]
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self.processor(
image,
text,
add_special_tokens=False,
return_tensors="pt",
)
return inputs
def predict(self,image:Image) -> str:
inputs = self.prepare_inputs(image)
inputs = inputs.to("cuda")
inputs['attention_mask'] = inputs['attention_mask'].to(torch.float32)
print("attention_mask dtype:", inputs['attention_mask'].dtype)
generated_ids = self.model.generate(**inputs, max_new_tokens=self.max_tokens, use_cache=True)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return output_text[0]
def __call__(self, _: str, image: Image) -> str:
return self.predict(image)
if __name__=="__main__":
atlasocr=AtlasOCR()
img = Image.open("img.png")
output = atlasocr(image=img)
print(output)
```
## Ethical Considerations and Bias
- While AtlasOCR aims to be a valuable tool, it's important to acknowledge potential biases inherited from its training data.
- Language Coverage: The model is specialized for Darija. Applying it to other languages or Arabic dialects without further fine-tuning might result in suboptimal performance or misinterpretations.
- Content Bias: The real-world data sources (books, social media, educational materials) may reflect specific cultural or societal perspectives present in Moroccan content. Users should be mindful of this when interpreting results, especially in sensitive contexts.
- Privacy: As with any OCR system, care should be taken when processing images containing personal or sensitive information. Users are responsible for ensuring compliance with privacy regulations.
## Authors and Acknowledgments
AtlasOCR was developed by **AtlasIA**, a Moroccan AI Community dedicated to building open-source AI models and datasets for Moroccan dialects.
* Special Thanks:
- The **Hugging Face team** for providing the platform and resources for open-source AI.
- The developers of **Qwen2.5-VL 3B, Unsloth, and QLoRA** for their foundational work.
- The **Argilla team** for their collaborative annotation tool.
## Project Resources
- **GitHub Repository:** https://github.com/atlasia-ma/AtlasOCR
- **OCRSmith:** https://github.com/atlasia-ma/OCRSmith
- **Hugging Face Model Hub:** https://huggingface.co/atlasia/AtlasOCR
- **Hugging Face Demo Space:** https://huggingface.co/spaces/atlasia/AtlasOCR-demo
- **AtlasIA Website:** https://www.atlasia.ma/
- **Discord Community:** https://discord.com/invite/Y4szwqJ6jB
## Support AtlasIA
If you find AtlasOCR useful and wish to support our mission of building open-source AI for Moroccan dialects, please consider donating:
* Wise: https://wise.com/pay/business/atlasia1
* Buy Me a Coffee: https://buymeacoffee.com/atlasia
* GitHub Sponsors: https://github.com/sponsors/atlasia-ma?o=esb
## Citation
If you use AtlasOCR in your research, please cite:
```
@misc{atlasocr2025,
title={AtlasOCR: Open-Source OCR for Moroccan Darija with Vision–Language Models},
author={Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane},
year={2025},
howpublished={\url{https://huggingface.co/atlasia/AtlasOCR}},
organization={AtlasIA}
}
```
## Contributions
For more information about the AtlasOCR project, visit:
- [AtlasOCR BlogPost](https://huggingface.co/blog/imomayiz/atlasocr)
- [AtlasOCR Model](https://huggingface.co/atlasia/AtlasOCR)
- [AtlasOCR Demo](https://huggingface.co/spaces/atlasia/AtlasOCR-demo)
- [AtlasOCR Training Dataset](https://huggingface.co/datasets/atlasia/atlasOCR-data)
- [GitHub Repository](https://github.com/atlasia/AtlasOCR)