Update README.md

de11c1d verified about 1 month ago

4.59 kB

	---
	base_model: [Qwen2.5VL]
	library_name: transformers
	tags:
	- mergekit
	- merge
	---
	# Baseer-Nakba HTR: A State-of-the-Art VLM for Arabic Handwritten Text Recognition

	## Overview

	This repository contains the model weights and inference pipeline for our submission to the NAKBA NLP 2026 Arabic Handwritten Text Recognition (HTR) competition.

	Our approach adapts the 3B-parameter [Baseer](https://arxiv.org/abs/2509.18174) Vision-Language Model (VLM) to effectively parse and recognize highly cursive, historical Arabic manuscripts. Through a progressive training pipeline, domain-matched data augmentation, and advanced checkpoint merging, this unified model mitigates the challenges of varying writer styles, age-related document degradation, and morphological complexity.

	To try our Baseer model for document extraction, please visit: [Baseer](https://baseerocr.com/) — Baseer is the SOTA model on Arabic Document Extraction.

	---

	## 🏆 Competition Results

	Our final model (Misraj AI) secured 1st place on the official Nakba hidden test set [leaderboard](https://www.codabench.org/competitions/12591/).

	\| Rank \| Team \| CER \| WER \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| 🥇 1st \| Misraj AI \| 0.0790 \| 0.2440 \|
	\| 🥈 2nd \| Oblevit \| 0.0925 \| 0.3268 \|
	\| 🥉 3rd \| 3reeq \| 0.0938 \| 0.2996 \|
	\| 4th \| Latent Narratives \| 0.1050 \| 0.3106 \|
	\| 5th \| Al-Warraq \| 0.1142 \| 0.3780 \|
	\| 6th \| Not Gemma \| 0.1217 \| 0.3063 \|
	\| 7th \| NAMAA-Qari \| 0.1950 \| 0.5194 \|
	\| 8th \| Fahras \| 0.2269 \| 0.5223 \|
	\| — \| Baseline \| 0.3683 \| 0.6905 \|

	---

	## Training Methodology

	Our model was trained using a multi-stage Supervised Fine-Tuning (SFT) curriculum.

	1. Data Augmentation: The Muharaf enhancement dataset was converted to grayscale to match the visual complexity and tonal distribution of the Nakba competition data.
	2. Decoder-Only SFT: We first trained the text decoder autoregressively on the structurally similar Muharaf dataset to condition the language modeling head.
	3. Full Encoder-Decoder Tuning: We subsequently unfroze the vision encoder and trained the full architecture on the Nakba dataset using differential learning rates — a key step that yielded a >5% improvement in WER over decoder-only tuning.
	4. Checkpoint Merging: To stabilize predictions and maximize generalization, we merged our top-performing checkpoints (Epoch 1 and Epoch 5) using SLERP interpolation.

	---

	## Training Hyperparameters

	All supervised experiments were conducted with standardized hyperparameters across configurations.

	\| Parameter \| Value \|
	\| :--- \| :--- \|
	\| Hardware \| 2× NVIDIA H100 GPUs \|
	\| Base Model \| 3B-parameter Baseer \|
	\| Epochs \| 5 \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|
	\| Learning Rate Schedule \| Cosine \|
	\| Batch Size \| 128 \|
	\| Max Sequence Length \| 1200 tokens \|
	\| Input Image Resolution \| 644 × 644 pixels \|
	\| Decoder-Only Learning Rate \| 1e-4 \|
	\| Encoder Learning Rate \| 9e-6 \|
	\| Decoder Learning Rate (Full Tuning) \| 1e-4 \|

	---

	## Image Examples

	The model works reliably on images from the Nakba dataset and visually similar historical manuscripts.

	![image (1)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/MtU8b_IZ1_kbiwg3BISDg.jpeg)
	![image (2)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/bmzC1F1rJz52ljDo0LbOY.jpeg)
	![image (3)](https://cdn-uploads.huggingface.co/production/uploads/65276c7911a8a521c91bc10f/LNvoN4NkaVJ8zgUqzG8bm.jpeg)

	---

	## Merge Method

	This model was merged using the [SLERP](https://en.wikipedia.org/wiki/Slerp) merge method.

	### Models Merged

	- `Baseer_Nakba_ep_1`
	- `Baseer_Nakba_ep_5`

	### Configuration

	```yaml
	merge_method: slerp
	base_model: Baseer_Nakba_ep_1
	models:
	- model: Baseer_Nakba_ep_1
	- model: Baseer_Nakba_ep_5
	parameters:
	t:
	- value: 0.50
	dtype: bfloat16
	```

	---

	## Citation

	If you use this model or find our work helpful, please consider citing our paper:

	```bibtex
	@inproceedings{misrajai2026nakba,
	title = {Adapting Vision-Language Models for Historical Arabic Handwritten Text Recognition},
	author = {Misraj AI},
	booktitle = {Nakba OCR Competition, NLP 2026},
	year = {2026}
	}
	```

	---

	## Links

	- 🤗 Model weights: [Misraj/Baseer__Nakba](https://huggingface.co/Misraj/Baseer__Nakba)
	- 💻 Inference pipeline: [misraj-ai/Nakba-pipeline](https://github.com/misraj-ai/Nakba-pipeline)
	- 🌐 Live demo: [baseerocr.com](https://baseerocr.com/)
	- 📄 Competition: [Nakba Codabench](https://www.codabench.org/competitions/12591/)