| | --- |
| | language: |
| | - ar |
| | license: apache-2.0 |
| | tags: |
| | - optical-character-recognition |
| | - historical-manuscripts |
| | - arabic |
| | - vision-language |
| | - document-ai |
| | datasets: |
| | - mdnaseif/hafith-combined-benchmark |
| | - mdnaseif/hafith-synthetic-1m |
| | metrics: |
| | - character-error-rate |
| | - word-error-rate |
| | library_name: transformers |
| | pipeline_tag: image-to-text |
| | --- |
| | |
| | # HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition |
| |
|
| | State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining. |
| |
|
| | ## Model Summary |
| |
|
| | - **Architecture**: Vision-Language (Encoder-Decoder) |
| | - **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1) |
| | - **Text Decoder**: RoBERTa-Large (242M params, trained from scratch) |
| | - **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary) |
| | - **Total Parameters**: 642M |
| | - **Training**: 10 days on single RTX 4090 |
| | - **Inference Speed**: 12.5 samples/second (~45K lines/hour) |
| |
|
| | ## Performance |
| |
|
| | | Dataset | CER | WER | Relative Improvement | |
| | |---------|-----|-----|---------------------| |
| | | MUHARAF | 8.35% | 24.76% | -71% vs TrOCR | |
| | | KHATT | 11.21% | 37.36% | -37% vs TrOCR | |
| | | RASAM | 4.95% | 18.94% | -86% vs TrOCR | |
| | | **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** | |
| |
|
| | **State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER) |
| |
|
| | ## Key Innovations |
| |
|
| | 1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts |
| | 2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches |
| | 3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers pillow torch |
| | ``` |
| |
|
| | ### Basic Inference |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoTokenizer |
| | from PIL import Image |
| | |
| | # Load model and tokenizer |
| | model = AutoModel.from_pretrained("mdnaseif/hafith") |
| | tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith") |
| | |
| | # Load manuscript image |
| | image = Image.open("manuscript_line.jpg") |
| | |
| | # Run OCR |
| | with torch.no_grad(): |
| | outputs = model.generate(image, max_length=64) |
| | text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | |
| | print(f"Recognized text: {text}") |
| | ``` |
| |
|
| | ### Batch Processing |
| |
|
| | ```python |
| | from datasets import load_dataset |
| | |
| | # Load your manuscript dataset |
| | dataset = load_dataset("your_dataset") |
| | |
| | # Process in batches |
| | batch_size = 32 |
| | for i in range(0, len(dataset), batch_size): |
| | batch = dataset[i:i+batch_size] |
| | images = [img.convert('RGB') for img in batch['image']] |
| | |
| | outputs = model.generate(images, max_length=64) |
| | texts = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
| | |
| | for img_id, text in zip(batch['id'], texts): |
| | print(f"{img_id}: {text}") |
| | ``` |
| |
|
| | ## Model Architecture |
| |
|
| | ``` |
| | Input Image (HΓWΓ3) |
| | β |
| | SigLIP V2 NaFlex Encoder |
| | - 400M parameters |
| | - Up to 512 patches (aspect-ratio preserving) |
| | - Output: 512Γ1152 embeddings |
| | β |
| | Projection Layer (1152 β 1024) |
| | β |
| | RoBERTa-Large Decoder |
| | - 24 layers, 16 attention heads |
| | - Trained from scratch with Aranizer |
| | - Cross-attention to visual features |
| | β |
| | Aranizer Tokenizer (64K vocab) |
| | β |
| | Arabic Text Output |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Pretraining |
| | - **Data**: 1M synthetic samples (900K train, 50K val, 50K test) |
| | - **Optimizer**: AdamW |
| | - **Learning Rate**: 5e-5 |
| | - **Batch Size**: 32 |
| | - **Duration**: 8 days on RTX 4090 |
| | - **Precision**: Mixed FP16 |
| |
|
| | ### Fine-tuning |
| | - **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test) |
| | - **Learning Rate**: 1e-5 |
| | - **Duration**: 2 days on RTX 4090 |
| |
|
| | ## Limitations |
| |
|
| | - Operates on pre-segmented text lines (requires line segmentation for full pages) |
| | - Trained on modern Arabic vocabulary (may miss some archaic terms) |
| | - Performance degrades on severely damaged manuscripts (>9% CER) |
| | - Maximum line length limited by 512-patch budget |
| |
|
| | ## Comparison with Baselines |
| |
|
| | | Model | Encoder | Tokenizer | CER | WER | |
| | |-------|---------|-----------|-----|-----| |
| | | CRNN+CTC | CNN | Character-level | 14.82% | - | |
| | | TrOCR-Base | BEiT-B (384Γ384) | RoBERTa | 13.41% | - | |
| | | TrOCR-Large | BEiT-L (384Γ384) | RoBERTa | 11.73% | 31.82% | |
| | | HATFormer | BEiT-L (384Γ384) | RoBERTa | 8.60% | - | |
| | | **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** | |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{naseif2026hafith, |
| | title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition}, |
| | author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis}, |
| | journal={arXiv preprint arXiv:XXXX.XXXXX}, |
| | year={2026} |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - π **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX) |
| | - π **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark) |
| | - π’ **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m) |
| | - π» **Code**: [GitHub](https://github.com/mdnaseif/hafith) |
| |
|
| | ## Contact |
| |
|
| | - **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa) |
| | - **Institution**: University of Prince Mugrin, Medina, Saudi Arabia |
| | - **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues) |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |