File size: 5,407 Bytes
ea8c975 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | ---
language:
- ar
license: apache-2.0
tags:
- optical-character-recognition
- historical-manuscripts
- arabic
- vision-language
- document-ai
datasets:
- mdnaseif/hafith-combined-benchmark
- mdnaseif/hafith-synthetic-1m
metrics:
- character-error-rate
- word-error-rate
library_name: transformers
pipeline_tag: image-to-text
---
# HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition
State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining.
## Model Summary
- **Architecture**: Vision-Language (Encoder-Decoder)
- **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1)
- **Text Decoder**: RoBERTa-Large (242M params, trained from scratch)
- **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary)
- **Total Parameters**: 642M
- **Training**: 10 days on single RTX 4090
- **Inference Speed**: 12.5 samples/second (~45K lines/hour)
## Performance
| Dataset | CER | WER | Relative Improvement |
|---------|-----|-----|---------------------|
| MUHARAF | 8.35% | 24.76% | -71% vs TrOCR |
| KHATT | 11.21% | 37.36% | -37% vs TrOCR |
| RASAM | 4.95% | 18.94% | -86% vs TrOCR |
| **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** |
**State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER)
## Key Innovations
1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts
2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches
3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training
## Usage
### Installation
```bash
pip install transformers pillow torch
```
### Basic Inference
```python
from transformers import AutoModel, AutoTokenizer
from PIL import Image
# Load model and tokenizer
model = AutoModel.from_pretrained("mdnaseif/hafith")
tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith")
# Load manuscript image
image = Image.open("manuscript_line.jpg")
# Run OCR
with torch.no_grad():
outputs = model.generate(image, max_length=64)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Recognized text: {text}")
```
### Batch Processing
```python
from datasets import load_dataset
# Load your manuscript dataset
dataset = load_dataset("your_dataset")
# Process in batches
batch_size = 32
for i in range(0, len(dataset), batch_size):
batch = dataset[i:i+batch_size]
images = [img.convert('RGB') for img in batch['image']]
outputs = model.generate(images, max_length=64)
texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for img_id, text in zip(batch['id'], texts):
print(f"{img_id}: {text}")
```
## Model Architecture
```
Input Image (HΓWΓ3)
β
SigLIP V2 NaFlex Encoder
- 400M parameters
- Up to 512 patches (aspect-ratio preserving)
- Output: 512Γ1152 embeddings
β
Projection Layer (1152 β 1024)
β
RoBERTa-Large Decoder
- 24 layers, 16 attention heads
- Trained from scratch with Aranizer
- Cross-attention to visual features
β
Aranizer Tokenizer (64K vocab)
β
Arabic Text Output
```
## Training Details
### Pretraining
- **Data**: 1M synthetic samples (900K train, 50K val, 50K test)
- **Optimizer**: AdamW
- **Learning Rate**: 5e-5
- **Batch Size**: 32
- **Duration**: 8 days on RTX 4090
- **Precision**: Mixed FP16
### Fine-tuning
- **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test)
- **Learning Rate**: 1e-5
- **Duration**: 2 days on RTX 4090
## Limitations
- Operates on pre-segmented text lines (requires line segmentation for full pages)
- Trained on modern Arabic vocabulary (may miss some archaic terms)
- Performance degrades on severely damaged manuscripts (>9% CER)
- Maximum line length limited by 512-patch budget
## Comparison with Baselines
| Model | Encoder | Tokenizer | CER | WER |
|-------|---------|-----------|-----|-----|
| CRNN+CTC | CNN | Character-level | 14.82% | - |
| TrOCR-Base | BEiT-B (384Γ384) | RoBERTa | 13.41% | - |
| TrOCR-Large | BEiT-L (384Γ384) | RoBERTa | 11.73% | 31.82% |
| HATFormer | BEiT-L (384Γ384) | RoBERTa | 8.60% | - |
| **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** |
## Citation
```bibtex
@article{naseif2026hafith,
title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition},
author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}
```
## Links
- π **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX)
- π **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark)
- π’ **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m)
- π» **Code**: [GitHub](https://github.com/mdnaseif/hafith)
## Contact
- **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa)
- **Institution**: University of Prince Mugrin, Medina, Saudi Arabia
- **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues)
## License
Apache 2.0 |