| --- |
| base_model: [Qwen2.5VL] |
| library_name: transformers |
| tags: |
| - mergekit |
| - merge |
| --- |
| # Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition |
|
|
| ## Overview |
|
|
| This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition. |
|
|
| Our approach adapts the 3B-parameter [Baseer](https://arxiv.org/abs/2509.18174) Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts. Through a progressive training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity. |
|
|
| To try our Baseer model for document extraction, please visit: [Baseer](https://baseerocr.com/) β **Baseer** is the SOTA model on Arabic Document Extraction. |
|
|
| --- |
|
|
| ## π Competition Results |
|
|
| Our final model (**Misraj AI**) secured **1st place** on the official Nakba hidden test set [leaderboard](https://www.codabench.org/competitions/12591/). |
|
|
| | Rank | Team | CER | WER | |
| | :--- | :--- | :--- | :--- | |
| | π₯ 1st | **Misraj AI** | **0.0790** | **0.2440** | |
| | π₯ 2nd | Oblevit | 0.0925 | 0.3268 | |
| | π₯ 3rd | 3reeq | 0.0938 | 0.2996 | |
| | 4th | Latent Narratives | 0.1050 | 0.3106 | |
| | 5th | Al-Warraq | 0.1142 | 0.3780 | |
| | 6th | Not Gemma | 0.1217 | 0.3063 | |
| | 7th | NAMAA-Qari | 0.1950 | 0.5194 | |
| | 8th | Fahras | 0.2269 | 0.5223 | |
| | β | Baseline | 0.3683 | 0.6905 | |
|
|
| --- |
|
|
| ## Training Methodology |
|
|
| Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum. |
|
|
| 1. **Data Augmentation**: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data. |
| 2. **Decoder-Only SFT**: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head. |
| 3. **Full Encoder-Decoder Tuning**: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset using differential learning rates β a key step that yielded a >5% improvement in WER over decoder-only tuning. |
| 4. **Checkpoint Merging**: To stabilize predictions and maximize generalization, we merged our top-performing checkpoints (Epoch 1 and Epoch 5) using SLERP interpolation. |
|
|
| --- |
|
|
| ## Training Hyperparameters |
|
|
| All supervised experiments were conducted with standardized hyperparameters across configurations. |
|
|
| | Parameter | Value | |
| | :--- | :--- | |
| | **Hardware** | 2Γ NVIDIA H100 GPUs | |
| | **Base Model** | 3B-parameter Baseer | |
| | **Epochs** | 5 | |
| | **Optimizer** | AdamW | |
| | **Weight Decay** | 0.01 | |
| | **Learning Rate Schedule** | Cosine | |
| | **Batch Size** | 128 | |
| | **Max Sequence Length** | 1200 tokens | |
| | **Input Image Resolution** | 644 Γ 644 pixels | |
| | **Decoder-Only Learning Rate** | 1e-4 | |
| | **Encoder Learning Rate** | 9e-6 | |
| | **Decoder Learning Rate (Full Tuning)** | 1e-4 | |
|
|
| --- |
|
|
| ## Image Examples |
|
|
| The model works reliably on images from the Nakba dataset and visually similar historical manuscripts. |
|
|
|  |
|  |
|  |
|
|
| --- |
|
|
| ## Merge Method |
|
|
| This model was merged using the [SLERP](https://en.wikipedia.org/wiki/Slerp) merge method. |
|
|
| ### Models Merged |
|
|
| - `Baseer_Nakba_ep_1` |
| - `Baseer_Nakba_ep_5` |
|
|
| ### Configuration |
|
|
| ```yaml |
| merge_method: slerp |
| base_model: Baseer_Nakba_ep_1 |
| models: |
| - model: Baseer_Nakba_ep_1 |
| - model: Baseer_Nakba_ep_5 |
| parameters: |
| t: |
| - value: 0.50 |
| dtype: bfloat16 |
| ``` |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model or find our work helpful, please consider citing our paper: |
|
|
| ```bibtex |
| @inproceedings{misrajai2026nakba, |
| title = {Adapting Vision-Language Models for Historical Arabic Handwritten Text Recognition}, |
| author = {Misraj AI}, |
| booktitle = {Nakba OCR Competition, NLP 2026}, |
| year = {2026} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Links |
|
|
| - π€ Model weights: [Misraj/Baseer__Nakba](https://huggingface.co/Misraj/Baseer__Nakba) |
| - π» Inference pipeline: [misraj-ai/Nakba-pipeline](https://github.com/misraj-ai/Nakba-pipeline) |
| - π Live demo: [baseerocr.com](https://baseerocr.com/) |
| - π Competition: [Nakba Codabench](https://www.codabench.org/competitions/12591/) |