File size: 5,105 Bytes
893ff51 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
---
# OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
OCRVerse is the first holistic OCR method in an end-to-end manner that enables unified text-centric OCR and vision-centric OCR. It tackles the demand for managing and applying massive amounts of multimodal data by recognizing both text elements from images or scanned documents (Text-centric OCR) and visual elements from visually information-dense image sources (Vision-centric OCR) like charts, web pages, and science plots. The model uses a two-stage SFT-RL multi-domain training method for improved cross-domain fusion.
- **Paper:** [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639)
- **Repository:** [https://github.com/DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse)
## Sample Usage
OCRVerse can be used with the `transformers` library. Please ensure you have `transformers >= 4.57.0` installed.
```bash
pip install "transformers>=4.57.0"
```
### Text-Centric Task
This example demonstrates how to use OCRVerse for document parsing tasks.
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "path/to/your/image.jpg" # Example: "./assets/text_centric_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
### Vision-Centric Task
Below is an example of how to use OCRVerse for vision-centric tasks, such as generating Python code from a chart image.
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "path/to/your/image.png" # Example: "./assets/vision_centric_test.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.
Note that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.
Now, please give me the matplotlib code that reproduces the picture below."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
## Citation
```bibtex
@misc{zhong2026ocrverse,
title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
author={Yufeng Zhong and Lei Chen and Xuanle Zhao and Wenkang Han and Liming Zheng and Jing Huang and Deyang Jiang and Yilin Cao and Lin Ma and Zhixiong Zeng},
year={2026},
eprint={2601.21639},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.21639},
}
``` |