arxiv:2601.14490

GutenOCR: A Grounded Vision-Language Front-End for Documents

Published on Jan 20

· Submitted by

Hunter Heidenreich on Jan 22

Roots.AI

Upvote

Authors:

Hunter Heidenreich ,

Ben Elliott ,

Abstract

GutenOCR enhances vision-language models for document understanding by enabling unified reading, detection, and grounding through prompt-based interfaces trained on diverse document types.

AI-generated summary

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

View arXiv page View PDF Project page GitHub 8 Add to collection

Community

hheiden

Paper author Paper submitter about 17 hours ago

We're excited to share our first open model release, a grounded VLM for OCR applications!

hheiden-roots

about 14 hours ago

We also open-sourced our training code (for running things on a multi-GPU setup) with an Apache-2.0 license here: https://github.com/Roots-Automation/GutenOCR