arxiv:2605.12623

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Published on May 12

· Submitted by

Ahmed Heakl on May 20

Mohamed Bin Zayed University of Artificial Intelligence

Upvote

Authors:

Abstract

DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.

AI-generated summary

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

View arXiv page View PDF Add to collection

Community

ahmedheakl

Paper submitter about 15 hours ago

DocAtlas is a framework for constructing high-fidelity multilingual OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks, using differential rendering to produce model-free structural annotations from native documents. Evaluating 16 models reveals persistent gaps in low-resource scripts; DPO with rendering-derived ground truth achieves stable cross-lingual transfer (+1.9% in-domain, +1.8% out-of-domain) without base-language degradation, where supervised fine-tuning collapses by up to 21%.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.12623

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12623 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12623 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.