Alef-OCR-Image2Html

An Arabic OCR model that transforms document images,including historical texts, scanned pages, and handwritten materials—into structured and semantic HTML.

Key Features

Semantic HTML Output: Generates structured HTML with semantic tags (section, header, main, footer, table, etc.)
Multi-format Support: Handles various document types including historical manuscripts, newspaper articles, scientific papers, invoices, and more
Arabic-Optimized: Fine-tuned specifically for Arabic text recognition and structure extraction
Zero-cost Training: Developed using Kaggle's free tier computational resources

Model Architecture

Base Model: Qwen2.5-VL-Instruct
Fine-tuning Method: QLoRA with 4-bit quantization
LoRA Configuration: Rank 16 applied to all modules
Optimization: Unsloth for memory efficiency and training speed

Training Data

The model was trained on a custom dataset of 28K image-HTML pairs consisting of:

46% Web-scraped content (~13K samples): Arabic Wikipedia articles with cleaned semantic HTML
54% Synthetic data (~15K samples): Generated documents mimicking ~13 real-world formats with diverse layouts and styles

For more details, see the arabic-image2html dataset.

Training Procedure

Training was performed in two stages:

Stage 1:

Data: 40% of training dataset
Learning rate: 5e-5
LR scheduler: Linear

Stage 2:

Data: 30% of training dataset (different split)
Learning rate: 1e-5
LR scheduler: Cosine

Performance

Evaluated by the NAMAA community on an anonymous benchmark dataset:

Model	WER	CER	BLEU
Alef-OCR-Image2Html	0.92	0.72	0.19
Qari-OCR-v0.3 (baseline)	0.84	0.73	0.17

Key Results:

Better Character Error Rate (CER): 0.72 vs 0.73
Better BLEU Score: 0.19 vs 0.17
Higher Word Error Rate (WER) due to limited diacritics handling in training data

Related Resources

Dataset: arabic-image2html
Training and Inference Notebooks: Available in the repository

Citation

@misc{alef_ocr_image2html_2025,
  title={Alef-OCR-Image2Html: Arabic OCR to Semantic HTML},
  author={Oussama Ben Slama},
  year={2025},
  howpublished={Hugging Face Models},
  url={https://huggingface.co/OussamaBenSlama/Alef-OCR-Image2Html}
}

Acknowledgments

This work builds upon:

The NAMAA community's state-of-the-art Qari-OCR model

License

Apache2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including OussamaBenSlama/Alef-OCR-Image2Html

Alef-Image2Html

Collection

2 items • Updated Oct 29, 2025