Update README.md
Browse files
README.md
CHANGED
|
@@ -5,56 +5,77 @@ tags:
|
|
| 5 |
- mergekit
|
| 6 |
- merge
|
| 7 |
---
|
| 8 |
-
|
| 9 |
# Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition
|
| 10 |
|
| 11 |
## Overview
|
|
|
|
| 12 |
This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition.
|
| 13 |
-
Our approach adapts the 3B-parameter [Baseer](https://arxiv.org/abs/2509.18174) Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts.
|
| 14 |
-
training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
| :--- | :--- | :--- |
|
| 23 |
-
| Word Error Rate (WER) | 0.25 | 1st |
|
| 24 |
-
| Character Error Rate (CER) | 0.09 | 2nd |
|
| 25 |
|
| 26 |
## Training Methodology
|
|
|
|
| 27 |
Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum.
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Training Hyperparameters
|
| 34 |
-
|
|
|
|
| 35 |
|
| 36 |
| Parameter | Value |
|
| 37 |
| :--- | :--- |
|
| 38 |
-
| **Hardware** | 2 NVIDIA H100 GPUs
|
| 39 |
-
| **Base Model** |
|
| 40 |
-
| **Epochs** | 5
|
| 41 |
| **Optimizer** | AdamW |
|
| 42 |
-
| **Weight Decay** | 0.01
|
| 43 |
| **Learning Rate Schedule** | Cosine |
|
| 44 |
-
| **Batch Size** | 128
|
| 45 |
-
| **Max Sequence Length** | 1200 tokens
|
| 46 |
-
| **Input Image Resolution** | 644
|
| 47 |
-
| **Decoder-Only Learning Rate** | 1e-4
|
| 48 |
-
| **Encoder
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
## Image
|
| 51 |
-
The model work perfectly for images from Nakba datasets or similar ones.
|
| 52 |
|
| 53 |
-
|
| 54 |
|
|
|
|
| 55 |

|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
## Merge Method
|
| 60 |
|
|
@@ -62,23 +83,43 @@ This model was merged using the [SLERP](https://en.wikipedia.org/wiki/Slerp) mer
|
|
| 62 |
|
| 63 |
### Models Merged
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
* Basser_Nakab_ep_1
|
| 68 |
|
| 69 |
### Configuration
|
| 70 |
|
| 71 |
-
The following YAML configuration was used to produce this model:
|
| 72 |
-
|
| 73 |
```yaml
|
| 74 |
merge_method: slerp
|
| 75 |
-
base_model:
|
| 76 |
models:
|
| 77 |
-
- model:
|
| 78 |
-
- model:
|
| 79 |
parameters:
|
| 80 |
t:
|
| 81 |
- value: 0.50
|
| 82 |
dtype: bfloat16
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
- mergekit
|
| 6 |
- merge
|
| 7 |
---
|
|
|
|
| 8 |
# Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition
|
| 9 |
|
| 10 |
## Overview
|
| 11 |
+
|
| 12 |
This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition.
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
Our approach adapts the 3B-parameter [Baseer](https://arxiv.org/abs/2509.18174) Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts. Through a progressive training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.
|
| 15 |
+
|
| 16 |
+
To try our Baseer model for document extraction, please visit: [Baseer](https://baseerocr.com/) β **Baseer** is the SOTA model on Arabic Document Extraction.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## π Competition Results
|
| 21 |
+
|
| 22 |
+
Our final model (**Misraj AI**) secured **1st place** on the official Nakba hidden test set [leaderboard](https://www.codabench.org/competitions/12591/).
|
| 23 |
|
| 24 |
+
| Rank | Team | CER | WER |
|
| 25 |
+
| :--- | :--- | :--- | :--- |
|
| 26 |
+
| π₯ 1st | **Misraj AI** | **0.0790** | **0.2440** |
|
| 27 |
+
| π₯ 2nd | Oblevit | 0.0925 | 0.3268 |
|
| 28 |
+
| π₯ 3rd | 3reeq | 0.0938 | 0.2996 |
|
| 29 |
+
| 4th | Latent Narratives | 0.1050 | 0.3106 |
|
| 30 |
+
| 5th | Al-Warraq | 0.1142 | 0.3780 |
|
| 31 |
+
| 6th | Not Gemma | 0.1217 | 0.3063 |
|
| 32 |
+
| 7th | NAMAA-Qari | 0.1950 | 0.5194 |
|
| 33 |
+
| 8th | Fahras | 0.2269 | 0.5223 |
|
| 34 |
+
| β | Baseline | 0.3683 | 0.6905 |
|
| 35 |
|
| 36 |
+
---
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Training Methodology
|
| 39 |
+
|
| 40 |
Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum.
|
| 41 |
+
|
| 42 |
+
1. **Data Augmentation**: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data.
|
| 43 |
+
2. **Decoder-Only SFT**: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head.
|
| 44 |
+
3. **Full Encoder-Decoder Tuning**: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset using differential learning rates β a key step that yielded a >5% improvement in WER over decoder-only tuning.
|
| 45 |
+
4. **Checkpoint Merging**: To stabilize predictions and maximize generalization, we merged our top-performing checkpoints (Epoch 1 and Epoch 5) using SLERP interpolation.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
|
| 49 |
## Training Hyperparameters
|
| 50 |
+
|
| 51 |
+
All supervised experiments were conducted with standardized hyperparameters across configurations.
|
| 52 |
|
| 53 |
| Parameter | Value |
|
| 54 |
| :--- | :--- |
|
| 55 |
+
| **Hardware** | 2Γ NVIDIA H100 GPUs |
|
| 56 |
+
| **Base Model** | 3B-parameter Baseer |
|
| 57 |
+
| **Epochs** | 5 |
|
| 58 |
| **Optimizer** | AdamW |
|
| 59 |
+
| **Weight Decay** | 0.01 |
|
| 60 |
| **Learning Rate Schedule** | Cosine |
|
| 61 |
+
| **Batch Size** | 128 |
|
| 62 |
+
| **Max Sequence Length** | 1200 tokens |
|
| 63 |
+
| **Input Image Resolution** | 644 Γ 644 pixels |
|
| 64 |
+
| **Decoder-Only Learning Rate** | 1e-4 |
|
| 65 |
+
| **Encoder Learning Rate** | 9e-6 |
|
| 66 |
+
| **Decoder Learning Rate (Full Tuning)** | 1e-4 |
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
|
| 70 |
+
## Image Examples
|
|
|
|
| 71 |
|
| 72 |
+
The model works reliably on images from the Nakba dataset and visually similar historical manuscripts.
|
| 73 |
|
| 74 |
+

|
| 75 |

|
| 76 |
+

|
| 77 |
|
| 78 |
+
---
|
| 79 |
|
| 80 |
## Merge Method
|
| 81 |
|
|
|
|
| 83 |
|
| 84 |
### Models Merged
|
| 85 |
|
| 86 |
+
- `Baseer_Nakba_ep_1`
|
| 87 |
+
- `Baseer_Nakba_ep_5`
|
|
|
|
| 88 |
|
| 89 |
### Configuration
|
| 90 |
|
|
|
|
|
|
|
| 91 |
```yaml
|
| 92 |
merge_method: slerp
|
| 93 |
+
base_model: Baseer_Nakba_ep_1
|
| 94 |
models:
|
| 95 |
+
- model: Baseer_Nakba_ep_1
|
| 96 |
+
- model: Baseer_Nakba_ep_5
|
| 97 |
parameters:
|
| 98 |
t:
|
| 99 |
- value: 0.50
|
| 100 |
dtype: bfloat16
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## Citation
|
| 106 |
|
| 107 |
+
If you use this model or find our work helpful, please consider citing our paper:
|
| 108 |
+
|
| 109 |
+
```bibtex
|
| 110 |
+
@inproceedings{misrajai2026nakba,
|
| 111 |
+
title = {Adapting Vision-Language Models for Historical Arabic Handwritten Text Recognition},
|
| 112 |
+
author = {Misraj AI},
|
| 113 |
+
booktitle = {Nakba OCR Competition, NLP 2026},
|
| 114 |
+
year = {2026}
|
| 115 |
+
}
|
| 116 |
```
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## Links
|
| 121 |
+
|
| 122 |
+
- π€ Model weights: [Misraj/Baseer__Nakba](https://huggingface.co/Misraj/Baseer__Nakba)
|
| 123 |
+
- π» Inference pipeline: [misraj-ai/Nakba-pipeline](https://github.com/misraj-ai/Nakba-pipeline)
|
| 124 |
+
- π Live demo: [baseerocr.com](https://baseerocr.com/)
|
| 125 |
+
- π Competition: [Nakba Codabench](https://www.codabench.org/competitions/12591/)
|