OussamaBenSlama
/

Alef-OCR-Image2Html

Model card Files Files and versions

Alef-OCR-Image2Html / README.md

OussamaBenSlama's picture

OussamaBenSlama

Update Readme file

e36f209 verified 4 months ago

|

history blame contribute delete

2.7 kB

	# Alef-OCR-Image2Html

	An Arabic OCR model that transforms document images,including historical texts, scanned pages, and handwritten materials—into structured and semantic HTML.

	### Key Features

	- Semantic HTML Output: Generates structured HTML with semantic tags (section, header, main, footer, table, etc.)
	- Multi-format Support: Handles various document types including historical manuscripts, newspaper articles, scientific papers, invoices, and more
	- Arabic-Optimized: Fine-tuned specifically for Arabic text recognition and structure extraction
	- Zero-cost Training: Developed using Kaggle's free tier computational resources

	## Model Architecture

	- Base Model: Qwen2.5-VL-Instruct
	- Fine-tuning Method: QLoRA with 4-bit quantization
	- LoRA Configuration: Rank 16 applied to all modules
	- Optimization: Unsloth for memory efficiency and training speed

	## Training Data

	The model was trained on a custom dataset of 28K image-HTML pairs consisting of:

	- 46% Web-scraped content (~13K samples): Arabic Wikipedia articles with cleaned semantic HTML
	- 54% Synthetic data (~15K samples): Generated documents mimicking ~13 real-world formats with diverse layouts and styles

	For more details, see the [arabic-image2html dataset](https://huggingface.co/datasets/OussamaBenSlama/arabic-image2html).

	## Training Procedure

	Training was performed in two stages:

	Stage 1:
	- Data: 40% of training dataset
	- Learning rate: 5e-5
	- LR scheduler: Linear

	Stage 2:
	- Data: 30% of training dataset (different split)
	- Learning rate: 1e-5
	- LR scheduler: Cosine


	## Performance

	Evaluated by the NAMAA community on an anonymous benchmark dataset:

	\| Model \| WER \| CER \| BLEU \|
	\|-------\|-----\|-----\|------\|
	\| Alef-OCR-Image2Html \| 0.92 \| 0.72 \| 0.19 \|
	\| Qari-OCR-v0.3 (baseline) \| 0.84 \| 0.73 \| 0.17 \|

	Key Results:
	- Better Character Error Rate (CER): 0.72 vs 0.73
	- Better BLEU Score: 0.19 vs 0.17
	- Higher Word Error Rate (WER) due to limited diacritics handling in training data


	## Related Resources

	- Dataset: [arabic-image2html](https://huggingface.co/datasets/OussamaBenSlama/arabic-image2html)
	- Training and Inference Notebooks: [Available in the repository](https://github.com/OussamaBenSlama/Alef-OCR-Image2Html)

	## Citation

	```bibtex
	@misc{alef_ocr_image2html_2025,
	title={Alef-OCR-Image2Html: Arabic OCR to Semantic HTML},
	author={Oussama Ben Slama},
	year={2025},
	howpublished={Hugging Face Models},
	url={https://huggingface.co/OussamaBenSlama/Alef-OCR-Image2Html}
	}
	```

	## Acknowledgments

	This work builds upon:
	- The NAMAA community's state-of-the-art Qari-OCR model

	## License

	Apache2.0