Alef-OCR-Image2Html
An Arabic OCR model that transforms document images,including historical texts, scanned pages, and handwritten materials—into structured and semantic HTML.
Key Features
- Semantic HTML Output: Generates structured HTML with semantic tags (section, header, main, footer, table, etc.)
- Multi-format Support: Handles various document types including historical manuscripts, newspaper articles, scientific papers, invoices, and more
- Arabic-Optimized: Fine-tuned specifically for Arabic text recognition and structure extraction
- Zero-cost Training: Developed using Kaggle's free tier computational resources
Model Architecture
- Base Model: Qwen2.5-VL-Instruct
- Fine-tuning Method: QLoRA with 4-bit quantization
- LoRA Configuration: Rank 16 applied to all modules
- Optimization: Unsloth for memory efficiency and training speed
Training Data
The model was trained on a custom dataset of 28K image-HTML pairs consisting of:
- 46% Web-scraped content (~13K samples): Arabic Wikipedia articles with cleaned semantic HTML
- 54% Synthetic data (~15K samples): Generated documents mimicking ~13 real-world formats with diverse layouts and styles
For more details, see the arabic-image2html dataset.
Training Procedure
Training was performed in two stages:
Stage 1:
- Data: 40% of training dataset
- Learning rate: 5e-5
- LR scheduler: Linear
Stage 2:
- Data: 30% of training dataset (different split)
- Learning rate: 1e-5
- LR scheduler: Cosine
Performance
Evaluated by the NAMAA community on an anonymous benchmark dataset:
| Model |
WER |
CER |
BLEU |
| Alef-OCR-Image2Html |
0.92 |
0.72 |
0.19 |
| Qari-OCR-v0.3 (baseline) |
0.84 |
0.73 |
0.17 |
Key Results:
- Better Character Error Rate (CER): 0.72 vs 0.73
- Better BLEU Score: 0.19 vs 0.17
- Higher Word Error Rate (WER) due to limited diacritics handling in training data
Related Resources
Citation
@misc{alef_ocr_image2html_2025,
title={Alef-OCR-Image2Html: Arabic OCR to Semantic HTML},
author={Oussama Ben Slama},
year={2025},
howpublished={Hugging Face Models},
url={https://huggingface.co/OussamaBenSlama/Alef-OCR-Image2Html}
}
Acknowledgments
This work builds upon:
- The NAMAA community's state-of-the-art Qari-OCR model
License
Apache2.0