mdnaseif
/

hafith

+---
+language:
+- ar
+license: apache-2.0
+tags:
+- optical-character-recognition
+- historical-manuscripts
+- arabic
+- vision-language
+- document-ai
+datasets:
+- mdnaseif/hafith-combined-benchmark
+- mdnaseif/hafith-synthetic-1m
+metrics:
+- character-error-rate
+- word-error-rate
+library_name: transformers
+pipeline_tag: image-to-text
+---
+# HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition
+State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining.
+## Model Summary
+- **Architecture**: Vision-Language (Encoder-Decoder)
+- **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1)
+- **Text Decoder**: RoBERTa-Large (242M params, trained from scratch)
+- **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary)
+- **Total Parameters**: 642M
+- **Training**: 10 days on single RTX 4090
+- **Inference Speed**: 12.5 samples/second (~45K lines/hour)
+## Performance
+| Dataset | CER | WER | Relative Improvement |
+|---------|-----|-----|---------------------|
+| MUHARAF | 8.35% | 24.76% | -71% vs TrOCR |
+| KHATT | 11.21% | 37.36% | -37% vs TrOCR |
+| RASAM | 4.95% | 18.94% | -86% vs TrOCR |
+| **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** |
+**State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER)
+## Key Innovations
+1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts
+2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches
+3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training
+## Usage
+### Installation
+```bash
+pip install transformers pillow torch
+```
+### Basic Inference
+```python
+from transformers import AutoModel, AutoTokenizer
+from PIL import Image
+# Load model and tokenizer
+model = AutoModel.from_pretrained("mdnaseif/hafith")
+tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith")
+# Load manuscript image
+image = Image.open("manuscript_line.jpg")
+# Run OCR
+with torch.no_grad():
+    outputs = model.generate(image, max_length=64)
+    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f"Recognized text: {text}")
+```
+### Batch Processing
+```python
+from datasets import load_dataset
+# Load your manuscript dataset
+dataset = load_dataset("your_dataset")
+# Process in batches
+batch_size = 32
+for i in range(0, len(dataset), batch_size):
+    batch = dataset[i:i+batch_size]
+    images = [img.convert('RGB') for img in batch['image']]
+    outputs = model.generate(images, max_length=64)
+    texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+    for img_id, text in zip(batch['id'], texts):
+        print(f"{img_id}: {text}")
+```
+## Model Architecture
+```
+Input Image (H×W×3)
+    ↓
+SigLIP V2 NaFlex Encoder
+  - 400M parameters
+  - Up to 512 patches (aspect-ratio preserving)
+  - Output: 512×1152 embeddings
+    ↓
+Projection Layer (1152 → 1024)
+    ↓
+RoBERTa-Large Decoder
+  - 24 layers, 16 attention heads
+  - Trained from scratch with Aranizer
+  - Cross-attention to visual features
+    ↓
+Aranizer Tokenizer (64K vocab)
+    ↓
+Arabic Text Output
+```
+## Training Details
+### Pretraining
+- **Data**: 1M synthetic samples (900K train, 50K val, 50K test)
+- **Optimizer**: AdamW
+- **Learning Rate**: 5e-5
+- **Batch Size**: 32
+- **Duration**: 8 days on RTX 4090
+- **Precision**: Mixed FP16
+### Fine-tuning
+- **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test)
+- **Learning Rate**: 1e-5
+- **Duration**: 2 days on RTX 4090
+## Limitations
+- Operates on pre-segmented text lines (requires line segmentation for full pages)
+- Trained on modern Arabic vocabulary (may miss some archaic terms)
+- Performance degrades on severely damaged manuscripts (>9% CER)
+- Maximum line length limited by 512-patch budget
+## Comparison with Baselines
+| Model | Encoder | Tokenizer | CER | WER |
+|-------|---------|-----------|-----|-----|
+| CRNN+CTC | CNN | Character-level | 14.82% | - |
+| TrOCR-Base | BEiT-B (384×384) | RoBERTa | 13.41% | - |
+| TrOCR-Large | BEiT-L (384×384) | RoBERTa | 11.73% | 31.82% |
+| HATFormer | BEiT-L (384×384) | RoBERTa | 8.60% | - |
+| **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** |
+## Citation
+```bibtex
+@article{naseif2026hafith,
+  title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition},
+  author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis},
+  journal={arXiv preprint arXiv:XXXX.XXXXX},
+  year={2026}
+}
+```
+## Links
+- 📄 **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX)
+- 📊 **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark)
+- 🔢 **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m)
+- 💻 **Code**: [GitHub](https://github.com/mdnaseif/hafith)
+## Contact
+- **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa)
+- **Institution**: University of Prince Mugrin, Medina, Saudi Arabia
+- **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues)
+## License
+Apache 2.0