File size: 5,407 Bytes
ea8c975
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
language:
- ar
license: apache-2.0
tags:
- optical-character-recognition
- historical-manuscripts
- arabic
- vision-language
- document-ai
datasets:
- mdnaseif/hafith-combined-benchmark
- mdnaseif/hafith-synthetic-1m
metrics:
- character-error-rate
- word-error-rate
library_name: transformers
pipeline_tag: image-to-text
---

# HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition

State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining.

## Model Summary

- **Architecture**: Vision-Language (Encoder-Decoder)
- **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1)
- **Text Decoder**: RoBERTa-Large (242M params, trained from scratch)
- **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary)
- **Total Parameters**: 642M
- **Training**: 10 days on single RTX 4090
- **Inference Speed**: 12.5 samples/second (~45K lines/hour)

## Performance

| Dataset | CER | WER | Relative Improvement |
|---------|-----|-----|---------------------|
| MUHARAF | 8.35% | 24.76% | -71% vs TrOCR |
| KHATT | 11.21% | 37.36% | -37% vs TrOCR |
| RASAM | 4.95% | 18.94% | -86% vs TrOCR |
| **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** |

**State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER)

## Key Innovations

1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts
2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches
3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training

## Usage

### Installation

```bash
pip install transformers pillow torch
```

### Basic Inference

```python
from transformers import AutoModel, AutoTokenizer
from PIL import Image

# Load model and tokenizer
model = AutoModel.from_pretrained("mdnaseif/hafith")
tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith")

# Load manuscript image
image = Image.open("manuscript_line.jpg")

# Run OCR
with torch.no_grad():
    outputs = model.generate(image, max_length=64)
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Recognized text: {text}")
```

### Batch Processing

```python
from datasets import load_dataset

# Load your manuscript dataset
dataset = load_dataset("your_dataset")

# Process in batches
batch_size = 32
for i in range(0, len(dataset), batch_size):
    batch = dataset[i:i+batch_size]
    images = [img.convert('RGB') for img in batch['image']]
    
    outputs = model.generate(images, max_length=64)
    texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
    for img_id, text in zip(batch['id'], texts):
        print(f"{img_id}: {text}")
```

## Model Architecture

```
Input Image (HΓ—WΓ—3)
    ↓
SigLIP V2 NaFlex Encoder
  - 400M parameters
  - Up to 512 patches (aspect-ratio preserving)
  - Output: 512Γ—1152 embeddings
    ↓
Projection Layer (1152 β†’ 1024)
    ↓
RoBERTa-Large Decoder
  - 24 layers, 16 attention heads
  - Trained from scratch with Aranizer
  - Cross-attention to visual features
    ↓
Aranizer Tokenizer (64K vocab)
    ↓
Arabic Text Output
```

## Training Details

### Pretraining
- **Data**: 1M synthetic samples (900K train, 50K val, 50K test)
- **Optimizer**: AdamW
- **Learning Rate**: 5e-5
- **Batch Size**: 32
- **Duration**: 8 days on RTX 4090
- **Precision**: Mixed FP16

### Fine-tuning
- **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test)
- **Learning Rate**: 1e-5
- **Duration**: 2 days on RTX 4090

## Limitations

- Operates on pre-segmented text lines (requires line segmentation for full pages)
- Trained on modern Arabic vocabulary (may miss some archaic terms)
- Performance degrades on severely damaged manuscripts (>9% CER)
- Maximum line length limited by 512-patch budget

## Comparison with Baselines

| Model | Encoder | Tokenizer | CER | WER |
|-------|---------|-----------|-----|-----|
| CRNN+CTC | CNN | Character-level | 14.82% | - |
| TrOCR-Base | BEiT-B (384Γ—384) | RoBERTa | 13.41% | - |
| TrOCR-Large | BEiT-L (384Γ—384) | RoBERTa | 11.73% | 31.82% |
| HATFormer | BEiT-L (384Γ—384) | RoBERTa | 8.60% | - |
| **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** |

## Citation

```bibtex
@article{naseif2026hafith,
  title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition},
  author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
```

## Links

- πŸ“„ **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX)
- πŸ“Š **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark)
- πŸ”’ **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m)
- πŸ’» **Code**: [GitHub](https://github.com/mdnaseif/hafith)

## Contact

- **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa)
- **Institution**: University of Prince Mugrin, Medina, Saudi Arabia
- **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues)

## License

Apache 2.0