hafith / README.md
mdnaseif's picture
Create README.md
ea8c975 verified
---
language:
- ar
license: apache-2.0
tags:
- optical-character-recognition
- historical-manuscripts
- arabic
- vision-language
- document-ai
datasets:
- mdnaseif/hafith-combined-benchmark
- mdnaseif/hafith-synthetic-1m
metrics:
- character-error-rate
- word-error-rate
library_name: transformers
pipeline_tag: image-to-text
---
# HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition
State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining.
## Model Summary
- **Architecture**: Vision-Language (Encoder-Decoder)
- **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1)
- **Text Decoder**: RoBERTa-Large (242M params, trained from scratch)
- **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary)
- **Total Parameters**: 642M
- **Training**: 10 days on single RTX 4090
- **Inference Speed**: 12.5 samples/second (~45K lines/hour)
## Performance
| Dataset | CER | WER | Relative Improvement |
|---------|-----|-----|---------------------|
| MUHARAF | 8.35% | 24.76% | -71% vs TrOCR |
| KHATT | 11.21% | 37.36% | -37% vs TrOCR |
| RASAM | 4.95% | 18.94% | -86% vs TrOCR |
| **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** |
**State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER)
## Key Innovations
1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts
2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches
3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training
## Usage
### Installation
```bash
pip install transformers pillow torch
```
### Basic Inference
```python
from transformers import AutoModel, AutoTokenizer
from PIL import Image
# Load model and tokenizer
model = AutoModel.from_pretrained("mdnaseif/hafith")
tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith")
# Load manuscript image
image = Image.open("manuscript_line.jpg")
# Run OCR
with torch.no_grad():
outputs = model.generate(image, max_length=64)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Recognized text: {text}")
```
### Batch Processing
```python
from datasets import load_dataset
# Load your manuscript dataset
dataset = load_dataset("your_dataset")
# Process in batches
batch_size = 32
for i in range(0, len(dataset), batch_size):
batch = dataset[i:i+batch_size]
images = [img.convert('RGB') for img in batch['image']]
outputs = model.generate(images, max_length=64)
texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for img_id, text in zip(batch['id'], texts):
print(f"{img_id}: {text}")
```
## Model Architecture
```
Input Image (HΓ—WΓ—3)
↓
SigLIP V2 NaFlex Encoder
- 400M parameters
- Up to 512 patches (aspect-ratio preserving)
- Output: 512Γ—1152 embeddings
↓
Projection Layer (1152 β†’ 1024)
↓
RoBERTa-Large Decoder
- 24 layers, 16 attention heads
- Trained from scratch with Aranizer
- Cross-attention to visual features
↓
Aranizer Tokenizer (64K vocab)
↓
Arabic Text Output
```
## Training Details
### Pretraining
- **Data**: 1M synthetic samples (900K train, 50K val, 50K test)
- **Optimizer**: AdamW
- **Learning Rate**: 5e-5
- **Batch Size**: 32
- **Duration**: 8 days on RTX 4090
- **Precision**: Mixed FP16
### Fine-tuning
- **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test)
- **Learning Rate**: 1e-5
- **Duration**: 2 days on RTX 4090
## Limitations
- Operates on pre-segmented text lines (requires line segmentation for full pages)
- Trained on modern Arabic vocabulary (may miss some archaic terms)
- Performance degrades on severely damaged manuscripts (>9% CER)
- Maximum line length limited by 512-patch budget
## Comparison with Baselines
| Model | Encoder | Tokenizer | CER | WER |
|-------|---------|-----------|-----|-----|
| CRNN+CTC | CNN | Character-level | 14.82% | - |
| TrOCR-Base | BEiT-B (384Γ—384) | RoBERTa | 13.41% | - |
| TrOCR-Large | BEiT-L (384Γ—384) | RoBERTa | 11.73% | 31.82% |
| HATFormer | BEiT-L (384Γ—384) | RoBERTa | 8.60% | - |
| **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** |
## Citation
```bibtex
@article{naseif2026hafith,
title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition},
author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}
```
## Links
- πŸ“„ **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX)
- πŸ“Š **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark)
- πŸ”’ **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m)
- πŸ’» **Code**: [GitHub](https://github.com/mdnaseif/hafith)
## Contact
- **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa)
- **Institution**: University of Prince Mugrin, Medina, Saudi Arabia
- **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues)
## License
Apache 2.0