mdnaseif commited on
Commit
ea8c975
·
verified ·
1 Parent(s): 12788cd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -0
README.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: apache-2.0
5
+ tags:
6
+ - optical-character-recognition
7
+ - historical-manuscripts
8
+ - arabic
9
+ - vision-language
10
+ - document-ai
11
+ datasets:
12
+ - mdnaseif/hafith-combined-benchmark
13
+ - mdnaseif/hafith-synthetic-1m
14
+ metrics:
15
+ - character-error-rate
16
+ - word-error-rate
17
+ library_name: transformers
18
+ pipeline_tag: image-to-text
19
+ ---
20
+
21
+ # HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition
22
+
23
+ State-of-the-art OCR model for historical Arabic manuscripts achieving **5.10% CER** through native-resolution encoding, Arabic-native tokenization, and synthetic pretraining.
24
+
25
+ ## Model Summary
26
+
27
+ - **Architecture**: Vision-Language (Encoder-Decoder)
28
+ - **Vision Encoder**: SigLIP V2 NaFlex (400M params, preserves aspect ratios up to 20:1)
29
+ - **Text Decoder**: RoBERTa-Large (242M params, trained from scratch)
30
+ - **Tokenizer**: Aranizer-PBE-64k (64K Arabic vocabulary)
31
+ - **Total Parameters**: 642M
32
+ - **Training**: 10 days on single RTX 4090
33
+ - **Inference Speed**: 12.5 samples/second (~45K lines/hour)
34
+
35
+ ## Performance
36
+
37
+ | Dataset | CER | WER | Relative Improvement |
38
+ |---------|-----|-----|---------------------|
39
+ | MUHARAF | 8.35% | 24.76% | -71% vs TrOCR |
40
+ | KHATT | 11.21% | 37.36% | -37% vs TrOCR |
41
+ | RASAM | 4.95% | 18.94% | -86% vs TrOCR |
42
+ | **Combined** | **5.10%** | **18.05%** | **-57% vs TrOCR** |
43
+
44
+ **State-of-the-Art**: 36% relative improvement over previous best (HATFormer, 8% CER)
45
+
46
+ ## Key Innovations
47
+
48
+ 1. **Native-Resolution Encoding**: Preserves aspect ratios (5:1 to 20:1) using SigLIP V2 NaFlex with variable patch counts
49
+ 2. **Arabic-Native Tokenization**: Aranizer achieves 4:1 compression over character-level approaches
50
+ 3. **Synthetic Pretraining**: 1M manuscript-style samples across 350 Arabic fonts for from-scratch decoder training
51
+
52
+ ## Usage
53
+
54
+ ### Installation
55
+
56
+ ```bash
57
+ pip install transformers pillow torch
58
+ ```
59
+
60
+ ### Basic Inference
61
+
62
+ ```python
63
+ from transformers import AutoModel, AutoTokenizer
64
+ from PIL import Image
65
+
66
+ # Load model and tokenizer
67
+ model = AutoModel.from_pretrained("mdnaseif/hafith")
68
+ tokenizer = AutoTokenizer.from_pretrained("mdnaseif/hafith")
69
+
70
+ # Load manuscript image
71
+ image = Image.open("manuscript_line.jpg")
72
+
73
+ # Run OCR
74
+ with torch.no_grad():
75
+ outputs = model.generate(image, max_length=64)
76
+ text = tokenizer.decode(outputs[0], skip_special_tokens=True)
77
+
78
+ print(f"Recognized text: {text}")
79
+ ```
80
+
81
+ ### Batch Processing
82
+
83
+ ```python
84
+ from datasets import load_dataset
85
+
86
+ # Load your manuscript dataset
87
+ dataset = load_dataset("your_dataset")
88
+
89
+ # Process in batches
90
+ batch_size = 32
91
+ for i in range(0, len(dataset), batch_size):
92
+ batch = dataset[i:i+batch_size]
93
+ images = [img.convert('RGB') for img in batch['image']]
94
+
95
+ outputs = model.generate(images, max_length=64)
96
+ texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
97
+
98
+ for img_id, text in zip(batch['id'], texts):
99
+ print(f"{img_id}: {text}")
100
+ ```
101
+
102
+ ## Model Architecture
103
+
104
+ ```
105
+ Input Image (H×W×3)
106
+ ↓
107
+ SigLIP V2 NaFlex Encoder
108
+ - 400M parameters
109
+ - Up to 512 patches (aspect-ratio preserving)
110
+ - Output: 512×1152 embeddings
111
+ ↓
112
+ Projection Layer (1152 → 1024)
113
+ ↓
114
+ RoBERTa-Large Decoder
115
+ - 24 layers, 16 attention heads
116
+ - Trained from scratch with Aranizer
117
+ - Cross-attention to visual features
118
+ ↓
119
+ Aranizer Tokenizer (64K vocab)
120
+ ↓
121
+ Arabic Text Output
122
+ ```
123
+
124
+ ## Training Details
125
+
126
+ ### Pretraining
127
+ - **Data**: 1M synthetic samples (900K train, 50K val, 50K test)
128
+ - **Optimizer**: AdamW
129
+ - **Learning Rate**: 5e-5
130
+ - **Batch Size**: 32
131
+ - **Duration**: 8 days on RTX 4090
132
+ - **Precision**: Mixed FP16
133
+
134
+ ### Fine-tuning
135
+ - **Data**: Combined benchmark (37K train, 2.9K val, 3.4K test)
136
+ - **Learning Rate**: 1e-5
137
+ - **Duration**: 2 days on RTX 4090
138
+
139
+ ## Limitations
140
+
141
+ - Operates on pre-segmented text lines (requires line segmentation for full pages)
142
+ - Trained on modern Arabic vocabulary (may miss some archaic terms)
143
+ - Performance degrades on severely damaged manuscripts (>9% CER)
144
+ - Maximum line length limited by 512-patch budget
145
+
146
+ ## Comparison with Baselines
147
+
148
+ | Model | Encoder | Tokenizer | CER | WER |
149
+ |-------|---------|-----------|-----|-----|
150
+ | CRNN+CTC | CNN | Character-level | 14.82% | - |
151
+ | TrOCR-Base | BEiT-B (384×384) | RoBERTa | 13.41% | - |
152
+ | TrOCR-Large | BEiT-L (384×384) | RoBERTa | 11.73% | 31.82% |
153
+ | HATFormer | BEiT-L (384×384) | RoBERTa | 8.60% | - |
154
+ | **HAFITH (Ours)** | **SigLIP2 NaFlex** | **Aranizer** | **5.10%** | **18.05%** |
155
+
156
+ ## Citation
157
+
158
+ ```bibtex
159
+ @article{naseif2026hafith,
160
+ title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition},
161
+ author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis},
162
+ journal={arXiv preprint arXiv:XXXX.XXXXX},
163
+ year={2026}
164
+ }
165
+ ```
166
+
167
+ ## Links
168
+
169
+ - 📄 **Paper**: [arXiv](https://arxiv.org/abs/XXXX.XXXXX)
170
+ - 📊 **Benchmark Dataset**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark)
171
+ - 🔢 **Synthetic Dataset**: [mdnaseif/hafith-synthetic-1m](https://huggingface.co/datasets/mdnaseif/hafith-synthetic-1m)
172
+ - 💻 **Code**: [GitHub](https://github.com/mdnaseif/hafith)
173
+
174
+ ## Contact
175
+
176
+ - **Lead Author**: Mohammed Naseif (m.nasif@upm.edu.sa)
177
+ - **Institution**: University of Prince Mugrin, Medina, Saudi Arabia
178
+ - **Issues**: [GitHub Issues](https://github.com/mdnaseif/hafith/issues)
179
+
180
+ ## License
181
+
182
+ Apache 2.0