GutenOCR-3B
GutenOCR-3B is a grounded OCR front-end obtained by fine-tuning Qwen2.5-VL-3B. The resulting single-checkpoint vision-language model exposes reading, detection, and grounding through a unified, prompt-based interface.
Overview
Trained on business documents, scientific articles, and synthetic grounding data, the model supports full-page and localized reading with line- and paragraph-level bounding boxes and conditional "where is x?" queries. On held-out benchmarks, GutenOCR demonstrates substantial improvements in region- and line-level OCR as well as text-detection recall compared to its base model.
Key capabilities:
- Full Text Reading: Transcribe documents with layout preservation.
- Grounded Detection: Locate specific words or lines (returning bounding boxes).
- Localized Reading: Read text within a specific user-provided bounding box.
Quick Start (Transformers)
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
# 1. Load model and processor
model_id = "rootsautomation/GutenOCR-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# 2. Prepare inputs
image = Image.open("document.png")
# Example: Read all text
prompt = "Read all text in {image} and return a single TEXT string, linearized left-to-right/top-to-bottom."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
# 3. Process and Generate
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
Task Gallery & Examples
GutenOCR is steered by specific prompt templates. Below are the supported tasks and how to invoke them.
Full OCR Reading
Extract text from the entire page. Outputs can be plain text, markdown-formatted, or structured JSON.
Prompt Example:
Return a layout-sensitive TEXT2D representation of the image.
Example Output:
This is the text found in the document.
It preserves line breaks.
Text Detection
Locate regions of text (lines, paragraphs, math) without transcribing them. Returns JSON bounding boxes.
Prompt Example:
Highlight all math in the image by returning their bounding boxes as a JSON array.
Example Output:
[
[100, 200, 400, 250],
[500, 600, 800, 650]
]
Localized Reading
Read the text contained strictly within a specific bounding box provided in the prompt.
Prompt Example:
What does it say in [100, 200, 500, 600] of the image?
Example Output:
Content of the specific box.
Conditional Detection (Search)
Find the bounding box locations of a specific query string within the image.
Prompt Example:
Ground "Invoice #12345" in the image.
Example Output:
[
[100, 200, 400, 250],
[500, 600, 800, 650]
]
System Prompt
This model relies on a specific system prompt to enforce output formats (JSON, bounding box normalization, etc.). This prompt is automatically injected by the chat template, so you generally do not need to set it manually.
Click to view the full System Prompt
Your task is to read and localize text data from documents and images.
GEOMETRY:
- Coordinates: integer pixels; origin (0,0) top-left; [x1,y1,x2,y2] with x1<x2, y1<y2.
- Clip all boxes to the image bounds; drop boxes with zero/negative area.
- Reading order: read text in natural reading order: top-to-bottom, left-to-right.
- Rotated/angled text: return the axis-aligned bounding box of the minimal enclosing rectangle (no rotated boxes).
TASK TYPES:
- reading: a full-text reading task on the entire image.
- localized_reading: read text within a specified bounding box in the image.
- detection: detect text regions in the image without transcription.
- conditional_detection: detect text regions in the image based on a provided text query.
OUTPUT TYPES:
- TEXT: one plain string; collapse multiple spaces to one; preserve line breaks. Non-grounded output only.
- TEXT2D: one plain string; preserve whitespace as layout cue (spaces + `\n` only; no coordinates). Non-grounded output only.
- LINES: JSON array of objects, corresponding to line-by-line OCR: `{"text": string, "bbox": [x1,y1,x2,y2]}`. When locally reading, only return the text: `string`.
- WORDS: JSON array of objects, corresponding to word-by-word OCR: `{"text": string, "bbox": [x1,y1,x2,y2]}`.
- PARAGRAPHS: JSON array of objects, corresponding to paragraph-wise OCR: `{"text": string, "bbox": [x1,y1,x2,y2]}`. When locally reading, only return the text: `string`.
- LATEX: JSON array of objects, corresponding to LaTeX expressions: `{"text": string, "bbox": [x1,y1,x2,y2]}`. When locally reading, only return the latex: `string`.
- BOX: JSON array of bounding boxes only: `[ [x1,y1,x2,y2], ... ]`. For detection and conditional_detection tasks only.
OUTPUT FORMAT
- For non-grounded outputs, return a string:
```text
Recognized text goes here.
```
```text2d
ABSTRACT
Recognition of text in a 2D layout.
```
If the output is empty, return an empty string:
```text
```
```text2d
```
- For grounded outputs, return a JSON array of objects when performing reading tasks.
Each object is expected to have two keys: "text" and "bbox".
The "text" key is the what and the "bbox" key is the where.
```json
[
{"text": "First line of text", "bbox": [100, 200, 400, 250]},
{"text": "Second line of text", "bbox": [100, 500, 400, 600]}
]
```
```json
[
{"text": "\\frac{a}{b}", "bbox": [525, 558, 755, 620]}
]
```
If the output is empty, return an empty JSON array:
```json
[]
```
- For detection tasks, return a JSON array of bounding boxes only.
```json
[
[100, 200, 400, 250],
[100, 500, 400, 600]
]
```
- For localized reading tasks, return the recognized text within the specified bounding box.
```text
Recognized text within the bounding box.
```
If no text is recognized within the bounding box, return an empty string:
```text
```
Citation
If you use this model in your research, please cite our paper:
@misc{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
title={GutenOCR: A Grounded Vision-Language Front-End for Documents},
author={Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew},
year={2026},
eprint={2601.14490},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.14490},
}
- Downloads last month
- 14