Instructions to use rootsautomation/GutenOCR-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rootsautomation/GutenOCR-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="rootsautomation/GutenOCR-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("rootsautomation/GutenOCR-3B")
model = AutoModelForImageTextToText.from_pretrained("rootsautomation/GutenOCR-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use rootsautomation/GutenOCR-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rootsautomation/GutenOCR-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rootsautomation/GutenOCR-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/rootsautomation/GutenOCR-3B

SGLang

How to use rootsautomation/GutenOCR-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rootsautomation/GutenOCR-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rootsautomation/GutenOCR-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rootsautomation/GutenOCR-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rootsautomation/GutenOCR-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use rootsautomation/GutenOCR-3B with Docker Model Runner:
```
docker model run hf.co/rootsautomation/GutenOCR-3B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

GutenOCR-3B

GutenOCR-3B is a grounded OCR front-end obtained by fine-tuning Qwen2.5-VL-3B. The resulting single-checkpoint vision-language model exposes reading, detection, and grounding through a unified, prompt-based interface.

Paper | GitHub | Demo

Overview

Trained on business documents, scientific articles, and synthetic grounding data, the model supports full-page and localized reading with line- and paragraph-level bounding boxes and conditional "where is x?" queries. On held-out benchmarks, GutenOCR demonstrates substantial improvements in region- and line-level OCR as well as text-detection recall compared to its base model.

Key capabilities:

Full Text Reading: Transcribe documents with layout preservation.
Grounded Detection: Locate specific words or lines (returning bounding boxes).
Localized Reading: Read text within a specific user-provided bounding box.

Quick Start (Transformers)

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

# 1. Load model and processor
model_id = "rootsautomation/GutenOCR-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# 2. Prepare inputs
image = Image.open("document.png")

# Example: Read all text
prompt = "Read all text in {image} and return a single TEXT string, linearized left-to-right/top-to-bottom."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# 3. Process and Generate
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

Task Gallery & Examples

GutenOCR is steered by specific prompt templates. Below are the supported tasks and how to invoke them.

Full OCR Reading

Extract text from the entire page. Outputs can be plain text, markdown-formatted, or structured JSON.

Prompt Example:

Return a layout-sensitive TEXT2D representation of the image.

Example Output:

This is the text found in the document.
It preserves line breaks.

Text Detection

Locate regions of text (lines, paragraphs, math) without transcribing them. Returns JSON bounding boxes.

Prompt Example:

Highlight all math in the image by returning their bounding boxes as a JSON array.

Example Output:

[
  [100, 200, 400, 250],
  [500, 600, 800, 650]
]

Localized Reading

Read the text contained strictly within a specific bounding box provided in the prompt.

Prompt Example:

What does it say in [100, 200, 500, 600] of the image?

Example Output:

Content of the specific box.

Conditional Detection (Search)

Find the bounding box locations of a specific query string within the image.

Prompt Example:

Ground "Invoice #12345" in the image.

Example Output:

[
  [100, 200, 400, 250],
  [500, 600, 800, 650]
]

System Prompt

This model relies on a specific system prompt to enforce output formats (JSON, bounding box normalization, etc.). This prompt is automatically injected by the chat template, so you generally do not need to set it manually.

Click to view the full System Prompt

Your task is to read and localize text data from documents and images.

GEOMETRY:
    - Coordinates: integer pixels; origin (0,0) top-left; [x1,y1,x2,y2] with x1<x2, y1<y2.
    - Clip all boxes to the image bounds; drop boxes with zero/negative area.
    - Reading order: read text in natural reading order: top-to-bottom, left-to-right.
    - Rotated/angled text: return the axis-aligned bounding box of the minimal enclosing rectangle (no rotated boxes).

TASK TYPES:
    - reading: a full-text reading task on the entire image.
    - localized_reading: read text within a specified bounding box in the image.
    - detection: detect text regions in the image without transcription.
    - conditional_detection: detect text regions in the image based on a provided text query.

OUTPUT TYPES:
    - TEXT: one plain string; collapse multiple spaces to one; preserve line breaks. Non-grounded output only.
    - TEXT2D: one plain string; preserve whitespace as layout cue (spaces + `\n` only; no coordinates). Non-grounded output only.
    - LINES: JSON array of objects, corresponding to line-by-line OCR: `{"text": string, "bbox": [x1,y1,x2,y2]}`. When locally reading, only return the text: `string`.
    - WORDS: JSON array of objects, corresponding to word-by-word OCR: `{"text": string, "bbox": [x1,y1,x2,y2]}`.
    - PARAGRAPHS: JSON array of objects, corresponding to paragraph-wise OCR: `{"text": string, "bbox": [x1,y1,x2,y2]}`. When locally reading, only return the text: `string`.
    - LATEX: JSON array of objects, corresponding to LaTeX expressions: `{"text": string, "bbox": [x1,y1,x2,y2]}`. When locally reading, only return the latex: `string`.
    - BOX: JSON array of bounding boxes only: `[ [x1,y1,x2,y2], ... ]`. For detection and conditional_detection tasks only.

OUTPUT FORMAT
    - For non-grounded outputs, return a string: 

        ```text
        Recognized text goes here.
        ```

        ```text2d
            ABSTRACT
        
        Recognition of text in a 2D layout.
        ```

        If the output is empty, return an empty string:
        
        ```text
        ```

        ```text2d
        ```

    - For grounded outputs, return a JSON array of objects when performing reading tasks.
      Each object is expected to have two keys: "text" and "bbox". 
      The "text" key is the what and the "bbox" key is the where.

        ```json
        [
            {"text": "First line of text", "bbox": [100, 200, 400, 250]},
            {"text": "Second line of text", "bbox": [100, 500, 400, 600]}
        ]
        ```

        ```json
        [
            {"text": "\\frac{a}{b}", "bbox": [525, 558, 755, 620]}
        ]
        ```

        If the output is empty, return an empty JSON array:

        ```json
        []
        ```
    - For detection tasks, return a JSON array of bounding boxes only.

        ```json
        [
            [100, 200, 400, 250],
            [100, 500, 400, 600]
        ]
        ``` 
    - For localized reading tasks, return the recognized text within the specified bounding box.

        ```text
        Recognized text within the bounding box.
        ```

        If no text is recognized within the bounding box, return an empty string:

        ```text
        ```

Citation

If you use this model in your research, please cite our paper:

@misc{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
      title={GutenOCR: A Grounded Vision-Language Front-End for Documents}, 
      author={Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew},
      year={2026},
      eprint={2601.14490},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14490}, 
}