README.md · atlasia/AtlasOCR at main

AtlasOCR / README.md

imomayiz

Update README.md

f479df6 verified 5 months ago

preview code

raw

history blame contribute delete

11.8 kB

	---
	tags:
	- vision-language-model
	- ocr
	- darija
	- moroccan-arabic
	- open-source
	- fine-tuning
	- qwen2-vl
	- unsloth
	- qlora
	- text-recognition
	- computer-vision
	- arabic
	- natural-language-processing
	- atlasocr
	library_name: transformers
	datasets:
	- atlasia/atlasOCR-data
	inference: true
	---



	# AtlasOCR: The First Open-Source Darija OCR Model
	<center>

	<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/NiNo2NY_RpsVbZq7w3A7p.png" width=350 height=300/>
	</center>

	## Model Description

	AtlasOCR is the first open-source Optical Character Recognition (OCR) model specifically designed for Darija (Moroccan Arabic). It is built by fine-tuning the Qwen2.5-VL 3B Vision Language Model (VLM) using a comprehensive dataset of synthetic and real-world Darija text. AtlasOCR excels at extracting text from images, supporting a wide range of applications from digital preservation to social media analysis and accessibility for Moroccan content.
	- Blogpost: [AtlasOCR BlogPost](https://huggingface.co/blog/imomayiz/atlasocr)
	- Demo: [AtlasOCR-demo](https://huggingface.co/spaces/atlasia/AtlasOCR-demo)
	- Dataset: [`atlasia/atlasOCR-data`](https://huggingface.co/datasets/atlasia/atlasOCR-data)
	- AtlasOCRBench: [`atlasia/AtlasOCRBench`](https://huggingface.co/datasets/atlasia/AtlasOCRBench)

	### Key Features:
	- First Open-Source Darija OCR: Addresses a critical gap for developers and organizations working with Moroccan content.
	- Vision Language Model (VLM) Based: Leverages the power of VLMs to interpret both visual layout and linguistic context.
	- Efficient Fine-tuning: Utilizes QLoRA and Unsloth for parameter-efficient training, making it accessible on limited hardware.
	- State-of-the-Art Performance: Achieves high accuracy on Darija text and generalizes well to standard Arabic OCR tasks.
	- Comprehensive Data Curation: Trained on a unique dataset combining synthetic data from [OCRSmith](https://github.com/atlasia-ma/OCRSmith) and curated real-world sources (scanned books, social media, educational documents, cookbooks).

	## Intended Use

	AtlasOCR is intended for:
	- Text Extraction: Extracting Darija text from images, including social media posts, handwritten notes, scanned documents, and other visual content.
	- Digital Preservation: Converting historical Moroccan documents and manuscripts into digital, searchable formats.
	- Social Media Analysis: Understanding public discourse and sentiment in Darija-speaking communities.
	- Accessibility: Making visual content accessible to screen readers for individuals with visual impairments.
	- Research: Enabling large-scale text analysis of Moroccan content for linguistic and social studies.
	- As a Base Model: Further fine-tuning for specialized Darija OCR tasks or other VLM applications.

	## Limitations

	- Diacritics Handling: The model is primarily trained and evaluated on undiacritized text. Its performance on accurately recognizing or reconstructing Arabic diacritics (harakat) may vary.
	- Complex Layouts: While robust to many layouts, performance may degrade on highly complex, non-standard, or extremely cluttered document structures.
	- Language Specificity: Optimized for Darija and standard Arabic script. Performance on other Arabic dialects or languages using different scripts may not be optimal.

	## Model Details

	### Model Architecture
	AtlasOCR is based on the Qwen2.5-VL 3B architecture, which is a Vision Language Model (VLM). VLMs consist of three main components:
	1. Vision Encoder: Converts images into vector embeddings capturing visual properties.
	2. Modality Projection Module: Aligns visual features with the language model's representation space.
	3. Language Model: Processes aligned embeddings and text input to generate natural language outputs.

	### Training Data
	The model was fine-tuned on a unique and extensive dataset of Darija text, totaling 30,092 samples and 10.7 million words. The dataset composition is:
	- ~86% Synthetic Data: Generated using OCRSmith, an open-source toolkit for simulating real-world conditions (fonts, layouts, backgrounds, distortions).
	- ~14% Real-World Data: Curated from diverse sources:
	- Scanned Darija books (e.g., العَرَبِيَّةُ الدَّارِجَةُ by Mohammed El-Madlaoui El-Mounabhi, علشان الصغيرة والصغير by Farouk ElMarrakchi).
	- Social media images (poster-style PDFs with educational material).
	- Educational documents (e.g., driving license exams).
	- Cookbooks (scanned recipes in Darija).
	Real-world data was pseudo-labeled using Gemini 2.0 Flash and then human-annotated using Argilla for quality control.

	### Training Strategy
	- Base Model: Qwen2.5-VL 3B
	- Parameter-Efficient Fine-tuning:
	- QLoRA (Quantized Low-Rank Adaptation): Enabled fine-tuning of the 4-bit quantized model, significantly reducing memory requirements.
	- Unsloth: Accelerated training by up to 5x and reduced memory usage by 60% through optimized GPU kernels.
	- Key Hyperparameters (from ablation studies):
	- LoRA Rank (r) and Alpha (α): 128
	- LoRA Dropout: 0.05
	- Precision: 4-bit quantization
	- Learning Rate: 2e-4 (with batch size 16 and gradient accumulation)
	- Vision Layer Freezing: No (vision layers were fine-tuned for better performance).
	- RSLoRA: Not enabled (showed degradation in performance for this task).

	### Evaluation
	AtlasOCR was evaluated using Character Error Rate (CER) and Word Error Rate (WER) on two benchmarks:

	1. AtlasOCRBench (Proprietary, available on Hugging Face):
	- Composition: 251 samples, including 55 from scanned Darija books and synthetic data from OCRSmith.
	- Curation: Two-step pseudo-labeling with Gemini 2.0 Flash and human annotation with Argilla.
	- Normalization: Removal of Arabic diacritics and whitespace normalization before metric calculation.
	- Primary Metric: CER, as it better reflects accuracy in Darija due to its non-standardized spelling.

	2. KITAB-Bench (Public):
	- A large-scale, multi-domain benchmark for Arabic OCR and document understanding (8,800+ samples).

	#### Evaluation Results
	AtlasBench Performance
	<center>

	<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/H2zllMtjgsG_vO49HiBPJ.png" width="700" height="700">
	</center>

	KitabBench Performance
	<center>

	<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/GrbnqdSAT__bRzkKs_1lX.png" width="700" height="800">
	</center>

	> AtlasOCR demonstrates strong performance on both Darija-specific challenges and general Arabic OCR tasks, competing effectively with larger models.

	## How to Use

	### Installation
	```bash
	pip install unsloth
	```

	### Inference
	```python
	import os
	from PIL import Image
	from unsloth import FastVisionModel
	import torch

	class AtlasOCR:
	def __init__(self, model_name: str="atlasia/AtlasOCR-v0", max_tokens: int=2000):
	self.model, self.processor = FastVisionModel.from_pretrained(
	model_name,
	device_map="auto",
	load_in_4bit=True,
	use_gradient_checkpointing="unsloth"
	)
	self.max_tokens = max_tokens
	self.prompt = "Extract the text in the image. Give me the final text, nothing else."

	def prepare_inputs(self,image:Image):
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	},
	{"type": "text", "text": self.prompt},
	],
	}
	]

	text = self.processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)

	inputs = self.processor(
	image,
	text,
	add_special_tokens=False,
	return_tensors="pt",
	)
	return inputs

	def predict(self,image:Image) -> str:
	inputs = self.prepare_inputs(image)
	inputs = inputs.to("cuda")

	inputs['attention_mask'] = inputs['attention_mask'].to(torch.float32)
	print("attention_mask dtype:", inputs['attention_mask'].dtype)

	generated_ids = self.model.generate(**inputs, max_new_tokens=self.max_tokens, use_cache=True)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = self.processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	return output_text[0]

	def __call__(self, _: str, image: Image) -> str:
	return self.predict(image)
	if __name__=="__main__":
	atlasocr=AtlasOCR()
	img = Image.open("img.png")
	output = atlasocr(image=img)
	print(output)
	```

	## Ethical Considerations and Bias
	- While AtlasOCR aims to be a valuable tool, it's important to acknowledge potential biases inherited from its training data.
	- Language Coverage: The model is specialized for Darija. Applying it to other languages or Arabic dialects without further fine-tuning might result in suboptimal performance or misinterpretations.
	- Content Bias: The real-world data sources (books, social media, educational materials) may reflect specific cultural or societal perspectives present in Moroccan content. Users should be mindful of this when interpreting results, especially in sensitive contexts.
	- Privacy: As with any OCR system, care should be taken when processing images containing personal or sensitive information. Users are responsible for ensuring compliance with privacy regulations.

	## Authors and Acknowledgments
	AtlasOCR was developed by AtlasIA, a Moroccan AI Community dedicated to building open-source AI models and datasets for Moroccan dialects.

	* Special Thanks:
	- The Hugging Face team for providing the platform and resources for open-source AI.
	- The developers of Qwen2.5-VL 3B, Unsloth, and QLoRA for their foundational work.
	- The Argilla team for their collaborative annotation tool.

	## Project Resources
	- GitHub Repository: https://github.com/atlasia-ma/AtlasOCR
	- OCRSmith: https://github.com/atlasia-ma/OCRSmith
	- Hugging Face Model Hub: https://huggingface.co/atlasia/AtlasOCR
	- Hugging Face Demo Space: https://huggingface.co/spaces/atlasia/AtlasOCR-demo
	- AtlasIA Website: https://www.atlasia.ma/
	- Discord Community: https://discord.com/invite/Y4szwqJ6jB
	## Support AtlasIA
	If you find AtlasOCR useful and wish to support our mission of building open-source AI for Moroccan dialects, please consider donating:
	* Wise: https://wise.com/pay/business/atlasia1
	* Buy Me a Coffee: https://buymeacoffee.com/atlasia
	* GitHub Sponsors: https://github.com/sponsors/atlasia-ma?o=esb

	## Citation
	If you use AtlasOCR in your research, please cite:
	```

	@misc{atlasocr2025,
	title={AtlasOCR: Open-Source OCR for Moroccan Darija with Vision–Language Models},
	author={Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane},
	year={2025},
	howpublished={\url{https://huggingface.co/atlasia/AtlasOCR}},
	organization={AtlasIA}
	}
	```

	## Contributions
	For more information about the AtlasOCR project, visit:
	- [AtlasOCR BlogPost](https://huggingface.co/blog/imomayiz/atlasocr)
	- [AtlasOCR Model](https://huggingface.co/atlasia/AtlasOCR)
	- [AtlasOCR Demo](https://huggingface.co/spaces/atlasia/AtlasOCR-demo)
	- [AtlasOCR Training Dataset](https://huggingface.co/datasets/atlasia/atlasOCR-data)
	- [GitHub Repository](https://github.com/atlasia/AtlasOCR)