muhammad0-0hreden commited on
Commit
de11c1d
Β·
verified Β·
1 Parent(s): 7817d84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -36
README.md CHANGED
@@ -5,56 +5,77 @@ tags:
5
  - mergekit
6
  - merge
7
  ---
8
-
9
  # Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition
10
 
11
  ## Overview
 
12
  This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition.
13
- Our approach adapts the 3B-parameter [Baseer](https://arxiv.org/abs/2509.18174) Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts.
14
- training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.
15
 
16
- To try our Basser model for document extraction, please visit: [Basser](https://baseerocr.com/) **Basser** is the SOTA model on Arabic Document Extraction.
 
 
 
 
 
 
 
 
17
 
18
- ## Competition Results
19
- Our final model secured top placements on the official Nakba hidden test set [leaderboard](https://www.codabench.org/competitions/12591/).
 
 
 
 
 
 
 
 
 
20
 
21
- | Metric | Score | Rank |
22
- | :--- | :--- | :--- |
23
- | Word Error Rate (WER) | 0.25 | 1st |
24
- | Character Error Rate (CER) | 0.09 | 2nd |
25
 
26
  ## Training Methodology
 
27
  Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum.
28
- **Data Augmentation**: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data.
29
- **Decoder-Only SFT**: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head.
30
- **Full Encoder-Decoder Tuning**: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset.
31
- **Checkpoint Merging**: To stabilize predictions and maximize generalization, we averaged the weights of our top-performing epochs (Epoch 1 and Epoch 5).
 
 
 
32
 
33
  ## Training Hyperparameters
34
- All supervised experiments were conducted ensuring standardized hyperparameters across configurations.
 
35
 
36
  | Parameter | Value |
37
  | :--- | :--- |
38
- | **Hardware** | 2 NVIDIA H100 GPUs |
39
- | **Base Model** | [3B-parameter Baseer |
40
- | **Epochs** | 5 |
41
  | **Optimizer** | AdamW |
42
- | **Weight Decay** | 0.01 |
43
  | **Learning Rate Schedule** | Cosine |
44
- | **Batch Size** | 128 |
45
- | **Max Sequence Length** | 1200 tokens |
46
- | **Input Image Resolution** | 644 x 644 pixels |
47
- | **Decoder-Only Learning Rate** | 1e-4 |
48
- | **Encoder-Decoder Learning Rate** | Text Decoder: 1e-4, Vision Encoder: 9e-6 |
 
 
 
49
 
50
- ## Image Example
51
- The model work perfectly for images from Nakba datasets or similar ones.
52
 
53
- ![image (1)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/MtU8b_IZ1_kbiwg3BISDg.jpeg)
54
 
 
55
  ![image (2)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/bmzC1F1rJz52ljDo0LbOY.jpeg)
 
56
 
57
- ![image](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/LNvoN4NkaVJ8zgUqzG8bm.jpeg)
58
 
59
  ## Merge Method
60
 
@@ -62,23 +83,43 @@ This model was merged using the [SLERP](https://en.wikipedia.org/wiki/Slerp) mer
62
 
63
  ### Models Merged
64
 
65
- The following models were included in the merge:
66
- * Basser_Nakab_ep_5
67
- * Basser_Nakab_ep_1
68
 
69
  ### Configuration
70
 
71
- The following YAML configuration was used to produce this model:
72
-
73
  ```yaml
74
  merge_method: slerp
75
- base_model: Basser_Nakab_ep_1
76
  models:
77
- - model: Basser_Nakab_ep_1
78
- - model: Basser_Nakab_ep_5
79
  parameters:
80
  t:
81
  - value: 0.50
82
  dtype: bfloat16
 
 
 
 
 
83
 
 
 
 
 
 
 
 
 
 
84
  ```
 
 
 
 
 
 
 
 
 
 
5
  - mergekit
6
  - merge
7
  ---
 
8
  # Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition
9
 
10
  ## Overview
11
+
12
  This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition.
 
 
13
 
14
+ Our approach adapts the 3B-parameter [Baseer](https://arxiv.org/abs/2509.18174) Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts. Through a progressive training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.
15
+
16
+ To try our Baseer model for document extraction, please visit: [Baseer](https://baseerocr.com/) β€” **Baseer** is the SOTA model on Arabic Document Extraction.
17
+
18
+ ---
19
+
20
+ ## πŸ† Competition Results
21
+
22
+ Our final model (**Misraj AI**) secured **1st place** on the official Nakba hidden test set [leaderboard](https://www.codabench.org/competitions/12591/).
23
 
24
+ | Rank | Team | CER | WER |
25
+ | :--- | :--- | :--- | :--- |
26
+ | πŸ₯‡ 1st | **Misraj AI** | **0.0790** | **0.2440** |
27
+ | πŸ₯ˆ 2nd | Oblevit | 0.0925 | 0.3268 |
28
+ | πŸ₯‰ 3rd | 3reeq | 0.0938 | 0.2996 |
29
+ | 4th | Latent Narratives | 0.1050 | 0.3106 |
30
+ | 5th | Al-Warraq | 0.1142 | 0.3780 |
31
+ | 6th | Not Gemma | 0.1217 | 0.3063 |
32
+ | 7th | NAMAA-Qari | 0.1950 | 0.5194 |
33
+ | 8th | Fahras | 0.2269 | 0.5223 |
34
+ | β€” | Baseline | 0.3683 | 0.6905 |
35
 
36
+ ---
 
 
 
37
 
38
  ## Training Methodology
39
+
40
  Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum.
41
+
42
+ 1. **Data Augmentation**: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data.
43
+ 2. **Decoder-Only SFT**: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head.
44
+ 3. **Full Encoder-Decoder Tuning**: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset using differential learning rates β€” a key step that yielded a >5% improvement in WER over decoder-only tuning.
45
+ 4. **Checkpoint Merging**: To stabilize predictions and maximize generalization, we merged our top-performing checkpoints (Epoch 1 and Epoch 5) using SLERP interpolation.
46
+
47
+ ---
48
 
49
  ## Training Hyperparameters
50
+
51
+ All supervised experiments were conducted with standardized hyperparameters across configurations.
52
 
53
  | Parameter | Value |
54
  | :--- | :--- |
55
+ | **Hardware** | 2Γ— NVIDIA H100 GPUs |
56
+ | **Base Model** | 3B-parameter Baseer |
57
+ | **Epochs** | 5 |
58
  | **Optimizer** | AdamW |
59
+ | **Weight Decay** | 0.01 |
60
  | **Learning Rate Schedule** | Cosine |
61
+ | **Batch Size** | 128 |
62
+ | **Max Sequence Length** | 1200 tokens |
63
+ | **Input Image Resolution** | 644 Γ— 644 pixels |
64
+ | **Decoder-Only Learning Rate** | 1e-4 |
65
+ | **Encoder Learning Rate** | 9e-6 |
66
+ | **Decoder Learning Rate (Full Tuning)** | 1e-4 |
67
+
68
+ ---
69
 
70
+ ## Image Examples
 
71
 
72
+ The model works reliably on images from the Nakba dataset and visually similar historical manuscripts.
73
 
74
+ ![image (1)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/MtU8b_IZ1_kbiwg3BISDg.jpeg)
75
  ![image (2)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/bmzC1F1rJz52ljDo0LbOY.jpeg)
76
+ ![image (3)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/LNvoN4NkaVJ8zgUqzG8bm.jpeg)
77
 
78
+ ---
79
 
80
  ## Merge Method
81
 
 
83
 
84
  ### Models Merged
85
 
86
+ - `Baseer_Nakba_ep_1`
87
+ - `Baseer_Nakba_ep_5`
 
88
 
89
  ### Configuration
90
 
 
 
91
  ```yaml
92
  merge_method: slerp
93
+ base_model: Baseer_Nakba_ep_1
94
  models:
95
+ - model: Baseer_Nakba_ep_1
96
+ - model: Baseer_Nakba_ep_5
97
  parameters:
98
  t:
99
  - value: 0.50
100
  dtype: bfloat16
101
+ ```
102
+
103
+ ---
104
+
105
+ ## Citation
106
 
107
+ If you use this model or find our work helpful, please consider citing our paper:
108
+
109
+ ```bibtex
110
+ @inproceedings{misrajai2026nakba,
111
+ title = {Adapting Vision-Language Models for Historical Arabic Handwritten Text Recognition},
112
+ author = {Misraj AI},
113
+ booktitle = {Nakba OCR Competition, NLP 2026},
114
+ year = {2026}
115
+ }
116
  ```
117
+
118
+ ---
119
+
120
+ ## Links
121
+
122
+ - πŸ€— Model weights: [Misraj/Baseer__Nakba](https://huggingface.co/Misraj/Baseer__Nakba)
123
+ - πŸ’» Inference pipeline: [misraj-ai/Nakba-pipeline](https://github.com/misraj-ai/Nakba-pipeline)
124
+ - 🌐 Live demo: [baseerocr.com](https://baseerocr.com/)
125
+ - πŸ“„ Competition: [Nakba Codabench](https://www.codabench.org/competitions/12591/)