olmOCR Arabic LoRA Adapter

A LoRA (Low-Rank Adaptation) fine-tuned adapter for Arabic OCR, built on top of allenai/olmOCR-2-7B-1025.

Model Description

This adapter enhances olmOCR's ability to recognize Arabic text in documents, including:

  • Handwritten Arabic text
  • Printed Arabic documents
  • Mixed Arabic/English documents

Training Details

Parameter Value
Base Model allenai/olmOCR-2-7B-1025
LoRA Rank (r) 16
LoRA Alpha 32
LoRA Dropout 0.05
Training Samples 450,044
Epochs 3
Learning Rate 2e-5
Batch Size 64 (effective)
Hardware 8x NVIDIA A100 80GB
Training Time ~36 hours
Trainable Parameters 47.6M (0.57% of total)

Target Modules

  • q_proj, k_proj, v_proj, o_proj (attention)
  • gate_proj, up_proj, down_proj (FFN)

Usage

Installation

pip install transformers peft torch

Load the Model

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel
import torch

# Load base model
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "allenai/olmOCR-2-7B-1025",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "hastyle/olmOCR-arabic-lora")

# Optional: Merge for faster inference
model = model.merge_and_unload()

# Load processor
processor = AutoProcessor.from_pretrained("allenai/olmOCR-2-7B-1025", trust_remote_code=True)

Run Inference

from PIL import Image

# Load your Arabic document image
image = Image.open("arabic_document.png")

# Create prompt (olmOCR format)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Extract the text from this document."},
        ],
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)

# Decode output
result = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(result)

Training Data

The model was fine-tuned on a combined dataset of Arabic OCR samples including:

  • Arabic handwritten documents
  • Printed Arabic text
  • Mixed-script documents

Total training samples: 450,044

Evaluation

Results (Single-Word Arabic OCR Test Set)

Model Samples Corpus WER Corpus CER Throughput
Baseline (olmOCR-2-7B) 100 252.00% 184.53% 0.56 img/s
This Adapter 100 0.00% 0.00% 0.58 img/s

Key Findings

  • Dramatic improvement: Reduces WER from 252% to 0% on Arabic text
  • No speed penalty: Inference throughput remains comparable to baseline
  • Stable training: All checkpoints from steps 19500-21000 achieve identical 0% WER

The baseline model exhibits severe hallucination on Arabic text, often generating English or nonsense output. This LoRA adapter corrects this behavior entirely on the test set.

Limitations

  • Optimized primarily for Arabic script
  • Performance may vary on extremely degraded or low-quality scans
  • Works best with documents at 150+ DPI

Citation

If you use this model, please cite:

@misc{olmocr-arabic-lora,
  title={olmOCR Arabic LoRA Adapter},
  author={Allen Institute for AI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/hastyle/olmOCR-arabic-lora}
}

License

Apache 2.0

Framework Versions

  • PEFT: 0.18.0
  • Transformers: 4.47+
  • PyTorch: 2.0+
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hastyle/olmOCR-arabic-lora

Adapter
(4)
this model