nielsr HF Staff commited on
Commit
893ff51
·
verified ·
1 Parent(s): c7735f7

Add model card for OCRVerse

Browse files

Hi! I'm Niels from the Hugging Face community team.

This PR adds a comprehensive model card for **OCRVerse**. It includes:
- Relevant metadata: `pipeline_tag`, `library_name`, and `license`.
- A link to the paper: [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639).
- A link to the official GitHub repository: [https://github.com/DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse).
- Sample usage code for both text-centric and vision-centric OCR tasks, directly derived from the official GitHub README.

Please review and merge if this looks good to you!

Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
8
+
9
+ OCRVerse is the first holistic OCR method in an end-to-end manner that enables unified text-centric OCR and vision-centric OCR. It tackles the demand for managing and applying massive amounts of multimodal data by recognizing both text elements from images or scanned documents (Text-centric OCR) and visual elements from visually information-dense image sources (Vision-centric OCR) like charts, web pages, and science plots. The model uses a two-stage SFT-RL multi-domain training method for improved cross-domain fusion.
10
+
11
+ - **Paper:** [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639)
12
+ - **Repository:** [https://github.com/DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse)
13
+
14
+ ## Sample Usage
15
+
16
+ OCRVerse can be used with the `transformers` library. Please ensure you have `transformers >= 4.57.0` installed.
17
+
18
+ ```bash
19
+ pip install "transformers>=4.57.0"
20
+ ```
21
+
22
+ ### Text-Centric Task
23
+
24
+ This example demonstrates how to use OCRVerse for document parsing tasks.
25
+
26
+ ```python
27
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
28
+ import torch
29
+
30
+ # Load model
31
+ model_path = 'DocTron/OCRVerse'
32
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
33
+ model_path,
34
+ dtype="auto",
35
+ device_map="cuda",
36
+ trust_remote_code=True
37
+ )
38
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
39
+
40
+ # Prepare input with image and text
41
+ image_path = "path/to/your/image.jpg" # Example: "./assets/text_centric_test.jpg"
42
+ # We recommend using the following prompt to better performance, since it is used throughout the training process.
43
+ prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
44
+
45
+ messages = [
46
+ {
47
+ "role": "user",
48
+ "content": [
49
+ {"type": "image", "image": image_path},
50
+ {"type": "text", "text": prompt},
51
+ ]
52
+ }
53
+ ]
54
+
55
+ # Preparation for inference
56
+ inputs = processor.apply_chat_template(
57
+ messages,
58
+ tokenize=True,
59
+ add_generation_prompt=True,
60
+ return_dict=True,
61
+ return_tensors="pt"
62
+ )
63
+ inputs = inputs.to(model.device)
64
+
65
+ # Inference: Generation of the output
66
+ generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
67
+
68
+ generated_ids = [
69
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
70
+ ]
71
+ output_text = processor.tokenizer.batch_decode(
72
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
73
+ )
74
+ print(output_text[0])
75
+ ```
76
+
77
+ ### Vision-Centric Task
78
+
79
+ Below is an example of how to use OCRVerse for vision-centric tasks, such as generating Python code from a chart image.
80
+
81
+ ```python
82
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
83
+ import torch
84
+
85
+ # Load model
86
+ model_path = 'DocTron/OCRVerse'
87
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
88
+ model_path,
89
+ dtype="auto",
90
+ device_map="cuda",
91
+ trust_remote_code=True
92
+ )
93
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
94
+
95
+ # Prepare input with image and text
96
+ image_path = "path/to/your/image.png" # Example: "./assets/vision_centric_test.png"
97
+ prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.
98
+ Note that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.
99
+ Now, please give me the matplotlib code that reproduces the picture below."
100
+
101
+ messages = [
102
+ {
103
+ "role": "user",
104
+ "content": [
105
+ {"type": "image", "image": image_path},
106
+ {"type": "text", "text": prompt},
107
+ ]
108
+ }
109
+ ]
110
+
111
+ # Preparation for inference
112
+ inputs = processor.apply_chat_template(
113
+ messages,
114
+ tokenize=True,
115
+ add_generation_prompt=True,
116
+ return_dict=True,
117
+ return_tensors="pt"
118
+ )
119
+ inputs = inputs.to(model.device)
120
+
121
+ # Inference: Generation of the output
122
+ generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
123
+
124
+ generated_ids = [
125
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
126
+ ]
127
+ output_text = processor.tokenizer.batch_decode(
128
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
129
+ )
130
+ print(output_text[0])
131
+ ```
132
+
133
+ ## Citation
134
+
135
+ ```bibtex
136
+ @misc{zhong2026ocrverse,
137
+ title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
138
+ author={Yufeng Zhong and Lei Chen and Xuanle Zhao and Wenkang Han and Liming Zheng and Jing Huang and Deyang Jiang and Yilin Cao and Lin Ma and Zhixiong Zeng},
139
+ year={2026},
140
+ eprint={2601.21639},
141
+ archivePrefix={arXiv},
142
+ primaryClass={cs.CV},
143
+ url={https://arxiv.org/abs/2601.21639},
144
+ }
145
+ ```