Papers
arxiv:2601.14490

GutenOCR: A Grounded Vision-Language Front-End for Documents

Published on Jan 20
· Submitted by
Hunter Heidenreich
on Jan 22
Authors:
,

Abstract

GutenOCR enhances vision-language models for document understanding by enabling unified reading, detection, and grounding through prompt-based interfaces trained on diverse document types.

AI-generated summary

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

Community

Paper author Paper submitter

We're excited to share our first open model release, a grounded VLM for OCR applications!

We also open-sourced our training code (for running things on a multi-GPU setup) with an Apache-2.0 license here: https://github.com/Roots-Automation/GutenOCR

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.14490 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.14490 in a Space README.md to link it from this page.

Collections including this paper 3