YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Alef-OCR-Image2Html

An Arabic OCR model that transforms document images,including historical texts, scanned pages, and handwritten materials—into structured and semantic HTML.

Key Features

  • Semantic HTML Output: Generates structured HTML with semantic tags (section, header, main, footer, table, etc.)
  • Multi-format Support: Handles various document types including historical manuscripts, newspaper articles, scientific papers, invoices, and more
  • Arabic-Optimized: Fine-tuned specifically for Arabic text recognition and structure extraction
  • Zero-cost Training: Developed using Kaggle's free tier computational resources

Model Architecture

  • Base Model: Qwen2.5-VL-Instruct
  • Fine-tuning Method: QLoRA with 4-bit quantization
  • LoRA Configuration: Rank 16 applied to all modules
  • Optimization: Unsloth for memory efficiency and training speed

Training Data

The model was trained on a custom dataset of 28K image-HTML pairs consisting of:

  • 46% Web-scraped content (~13K samples): Arabic Wikipedia articles with cleaned semantic HTML
  • 54% Synthetic data (~15K samples): Generated documents mimicking ~13 real-world formats with diverse layouts and styles

For more details, see the arabic-image2html dataset.

Training Procedure

Training was performed in two stages:

Stage 1:

  • Data: 40% of training dataset
  • Learning rate: 5e-5
  • LR scheduler: Linear

Stage 2:

  • Data: 30% of training dataset (different split)
  • Learning rate: 1e-5
  • LR scheduler: Cosine

Performance

Evaluated by the NAMAA community on an anonymous benchmark dataset:

Model WER CER BLEU
Alef-OCR-Image2Html 0.92 0.72 0.19
Qari-OCR-v0.3 (baseline) 0.84 0.73 0.17

Key Results:

  • Better Character Error Rate (CER): 0.72 vs 0.73
  • Better BLEU Score: 0.19 vs 0.17
  • Higher Word Error Rate (WER) due to limited diacritics handling in training data

Related Resources

Citation

@misc{alef_ocr_image2html_2025,
  title={Alef-OCR-Image2Html: Arabic OCR to Semantic HTML},
  author={Oussama Ben Slama},
  year={2025},
  howpublished={Hugging Face Models},
  url={https://huggingface.co/OussamaBenSlama/Alef-OCR-Image2Html}
}

Acknowledgments

This work builds upon:

  • The NAMAA community's state-of-the-art Qari-OCR model

License

Apache2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including OussamaBenSlama/Alef-OCR-Image2Html