--- language: - ar license: apache-2.0 tags: - optical-character-recognition - historical-manuscripts - arabic - vision-language - document-ai datasets: - mdnaseif/hafith-combined-benchmark - mdnaseif/hafith-synthetic-1m metrics: - character-error-rate - word-error-rate library_name: transformers pipeline_tag: image-to-text --- # HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining. ## Model Summary - **Architecture**: Vision-Language (Encoder-Decoder) - **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1) - **Text Decoder**: RoBERTa-Large (242M params, trained from scratch) - **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary) - **Total Parameters**: 642M - **Training**: 10 days on single RTX 4090 - **Inference Speed**: 12.5 samples/second (~45K lines/hour) ## Performance | Dataset | CER | WER | Relative Improvement | |---------|-----|-----|---------------------| | MUHARAF | 8.35% | 24.76% | -71% vs TrOCR | | KHATT | 11.21% | 37.36% | -37% vs TrOCR | | RASAM | 4.95% | 18.94% | -86% vs TrOCR | | **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** | **State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER) ## Key Innovations 1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts 2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches 3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training ## Usage ### Installation ```bash pip install transformers pillow torch ``` ### Basic Inference ```python from transformers import AutoModel, AutoTokenizer from PIL import Image # Load model and tokenizer model = AutoModel.from_pretrained("mdnaseif/hafith") tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith") # Load manuscript image image = Image.open("manuscript_line.jpg") # Run OCR with torch.no_grad(): outputs = model.generate(image, max_length=64) text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Recognized text: {text}") ``` ### Batch Processing ```python from datasets import load_dataset # Load your manuscript dataset dataset = load_dataset("your_dataset") # Process in batches batch_size = 32 for i in range(0, len(dataset), batch_size): batch = dataset[i:i+batch_size] images = [img.convert('RGB') for img in batch['image']] outputs = model.generate(images, max_length=64) texts = tokenizer.batch_decode(outputs, skip_special_tokens=True) for img_id, text in zip(batch['id'], texts): print(f"{img_id}: {text}") ``` ## Model Architecture ``` Input Image (H×W×3) ↓ SigLIP V2 NaFlex Encoder - 400M parameters - Up to 512 patches (aspect-ratio preserving) - Output: 512×1152 embeddings ↓ Projection Layer (1152 → 1024) ↓ RoBERTa-Large Decoder - 24 layers, 16 attention heads - Trained from scratch with Aranizer - Cross-attention to visual features ↓ Aranizer Tokenizer (64K vocab) ↓ Arabic Text Output ``` ## Training Details ### Pretraining - **Data**: 1M synthetic samples (900K train, 50K val, 50K test) - **Optimizer**: AdamW - **Learning Rate**: 5e-5 - **Batch Size**: 32 - **Duration**: 8 days on RTX 4090 - **Precision**: Mixed FP16 ### Fine-tuning - **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test) - **Learning Rate**: 1e-5 - **Duration**: 2 days on RTX 4090 ## Limitations - Operates on pre-segmented text lines (requires line segmentation for full pages) - Trained on modern Arabic vocabulary (may miss some archaic terms) - Performance degrades on severely damaged manuscripts (>9% CER) - Maximum line length limited by 512-patch budget ## Comparison with Baselines | Model | Encoder | Tokenizer | CER | WER | |-------|---------|-----------|-----|-----| | CRNN+CTC | CNN | Character-level | 14.82% | - | | TrOCR-Base | BEiT-B (384×384) | RoBERTa | 13.41% | - | | TrOCR-Large | BEiT-L (384×384) | RoBERTa | 11.73% | 31.82% | | HATFormer | BEiT-L (384×384) | RoBERTa | 8.60% | - | | **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** | ## Citation ```bibtex @article{naseif2026hafith, title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition}, author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2026} } ``` ## Links - 📄 **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX) - 📊 **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark) - 🔢 **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m) - 💻 **Code**: [GitHub](https://github.com/mdnaseif/hafith) ## Contact - **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa) - **Institution**: University of Prince Mugrin, Medina, Saudi Arabia - **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues) ## License Apache 2.0