Ketaba-OCR at NakbaNLP 2026 Shared Task: Efficient Adaptation of Vision-Language Models for Handwritten Text Recognition

Arabic Manuscript OCR

This repository contains the official models and results for Ketaba-OCR, the 1st place winning submission to the NakbaNLP 2026 Shared Task on Arabic Manuscript Understanding (Subtask 2: Systems Track).

By: Hassan Barmandah, Fatimah Emad Eldin, Khloud Al Jallad, Omar Nacer

Code HuggingFace Open In Colab License


Model Description

This project introduces a parameter-efficient approach for Arabic handwritten text recognition (HTR) on historical manuscripts. The system is built upon Sherif's pretrained Arabic-English HTR model, which leverages prior training on diverse handwritten datasets including Kitab and IAM. Rather than training from scratch, we fine-tune the HTR backbone using Low-Rank Adaptation (LoRA) with 4-bit quantization (QLoRA), along with DoRA and RSLoRA for improved training stability.

A key element of this system is its ensemble strategy using a novel Linear+Boost weighted voting scheme. This approach proved to be highly effective, achieving 1st place on the official leaderboard with a Character Error Rate (CER) of 0.0819 and Word Error Rate (WER) of 0.2588 on the blind test set.

The model transcribes cropped line images from Arabic manuscripts into machine-readable text, specifically optimized for the Omar Al-Saleh Memoir Collection (1951-1965) written in Ruq'ah and Naskh script variants.

Key Contributions

  • Ranking & Performance: Secured 1st place on the official leaderboard with CER 0.082 and WER 0.259
  • HTR vs. Generalist VLMs: Demonstrated that specialized fine-tuned HTR models drastically outperform zero-shot generalist VLMs
  • Parameter Efficiency: QLoRA efficiently bridged the domain gap, reducing CER from 0.58 to 0.08 with minimal computational overhead (~8GB VRAM)
  • Ensemble Innovation: Linear+Boost weighting strategy improved CER by 7.4% over standard inverse-CER weighting

πŸš€ How to Use

You can use the fine-tuned model directly with the transformers and peft libraries. The following example demonstrates inference on a manuscript line image.

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
from peft import PeftModel
from PIL import Image
from qwen_vl_utils import process_vision_info

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "sherif1313/Arabic-English-handwritten-OCR-v3",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "HassanB4/Ketab-OCR-LoRA")

# Apply weight tying fix (critical for correct output)
model.lm_head.weight = model.model.language_model.embed_tokens.weight

# Load processor
processor = AutoProcessor.from_pretrained(
    "sherif1313/Arabic-English-handwritten-OCR-v3",
    trust_remote_code=True
)

# Example inference
image = Image.open("manuscript_line.png").convert("RGB")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Ψ§Ω‚Ψ±Ψ£ Ψ§Ω„Ω†Ψ΅ Ψ§Ω„Ω…ΩˆΨ¬ΩˆΨ― في Ψ§Ω„Ψ΅ΩˆΨ±Ψ©:"}
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

transcription = processor.decode(output_ids[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
print(transcription)

βš™οΈ Training Procedure

The system employs QLoRA fine-tuning of a specialized pretrained HTR model, rather than training a general-purpose VLM from scratch.

Training Data

The model was fine-tuned on the official NakbaNLP 2026 dataset from the Omar Al-Saleh Memoir Collection. We trained on the full available dataset (train + dev test combined):

Split Samples Description
Training (Used) 18,057 Train (15,962) + Dev Test (2,095) combined
Blind Test 2,671 Held-out for official CodaBench evaluation

Hyperparameters

Parameter Value Parameter Value
Base Model sherif1313/Arabic-English-OCR-v3 Architecture Qwen2.5-VL-3B
Model Size ~4.07B parameters Trainable Params 75.6M (1.97%)
Quantization 4-bit NF4 (QLoRA) Compute Dtype bfloat16
Double Quant True Pretraining Data Kitab, IAM, Custom
LoRA Rank (r) 32 LoRA Alpha (Ξ±) 64
Target Modules q, k, v, o, gate, up, down LoRA Dropout 0.05
DoRA True RSLoRA True
Learning Rate 2Γ—10⁻⁡ Optimizer AdamW (fused)
LR Scheduler Cosine Warmup Steps 200
Batch Size 1 (per GPU) Gradient Accumulation 4
Effective Batch 4 Number of Epochs 1
Max Gradient Norm 1.0 Weight Decay 0.01
Max Sequence Length 2048 Max Image Size 1024

Ensemble Strategy

Our final submission employs a Linear+Boost weighted ensemble (Config 18) combining predictions from six model variants:

# Config 18: Linear+Boost<0.15
weights = normalize((1 - CER) + (CER < 0.15) * 0.5)

This applies a linear decay based on CER, plus a bonus weight of 0.5 for models with CER below 0.15 (rewarding the top 2 performers).

The ensemble algorithm uses:

  1. Weighted Majority Voting: Predictions exceeding 50% weighted consensus are selected directly
  2. Arabic Normalization: For disagreements, normalize alef variants and teh marbuta before voting
  3. N-gram Consistency: Score predictions by 3-gram overlap with other models
  4. Edit Distance Consensus: Final tie-breaking uses minimum average edit distance

Frameworks

  • PyTorch 2.5.0
  • Hugging Face Transformers β‰₯4.45.0
  • PEFT β‰₯0.14.0
  • bitsandbytes β‰₯0.43.0
  • Flash Attention 2.8.3

πŸ“Š Evaluation Results

The models were evaluated on the blind test set provided by the NakbaNLP 2026 organizers. The primary metric is Character Error Rate (CER), computed as normalized Levenshtein distance.

Final Test Set Scores

System Test CER Test WER Blind CER Blind WER
Organizer Baseline 0.584 0.881 0.591 0.885
Zero-Shot HRT (Qwen2.5-VL) 0.169 0.499 0.203 0.503
Fine-Tuned HRT (Single Model) 0.081 0.115 0.088 0.270
Ketaba-OCR + Ensemble (Ours) β€” β€” 0.0819 0.2588

Comparison with Other Models

Model Blind CER Blind WER
Ketaba-OCR (Ours) 0.0819 0.2588
Fine-Tuned QARI-3 0.2635 0.5521
Arabic OCR 4-bit (Sherif) 0.3234 0.6203
Qwen2.5-VL-7B (Zero-Shot) 0.6808 0.9198
Qwen2.5-VL-3B (Zero-Shot) 0.6213 0.8628

⚠️ Limitations

  • Domain Specificity: Optimized for 1950s Ruq'ah/Naskh manuscripts; requires adaptation for other periods/styles
  • Agglutination Gap: WER (0.26) is disproportionately higher than CER (0.08) due to Arabic's agglutinative structure
  • Degraded Images: Performance degrades on severely faded or damaged manuscript regions
  • Generalization: Not tested on other historical Arabic manuscript collections

πŸ™ Acknowledgements

We thank the NakbaNLP 2026 organizers (Fadi Zaraket, Bilal Shalash, Hadi Hamoud, Ahmad Chamseddine, Firas Ben Abid, Mustafa Jarrar, Chadi Abou Chakra, Bernard Ghanem) for access to the Omar Al-Saleh Memoir Collection. We acknowledge Sherif for the pretrained Arabic-English OCR model, and the Hugging Face community for PEFT and bitsandbytes libraries.

Related Links


πŸ“œ Citation

If you use this work, please cite the paper:

@inproceedings{barmandah2026ketaba,
    title={{Ketaba-OCR at NakbaNLP 2026 Shared Task: Efficient Adaptation of Vision-Language Models for Handwritten Text Recognition}},
    author={Barmandah, Hassan and Eldin, Fatimah Emad and Al Jallad, Khloud and Nacer, Omar},
    year={2026},
    booktitle={Proceedings of the 2nd International Workshop on Nakba Narratives as Language Resources (NakbaNLP 2026)},
    publisher={RASD}
}

πŸ“„ License

This project is licensed under the Apache 2.0 License.

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for HassanB4/Ketab-OCR-LoRA

Collection including HassanB4/Ketab-OCR-LoRA

Evaluation results