Add model card and metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
8
+
9
+ OCRVerse is a holistic OCR method developed in an end-to-end manner that enables unified text-centric OCR (e.g., documents, books) and vision-centric OCR (e.g., charts, web pages, scientific plots). It is proposed to bridge the gap between recognizing text elements and identifying visual elements from information-dense images.
10
+
11
+ - **Paper:** [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639)
12
+ - **Repository:** [DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse)
13
+
14
+ ## Usage Example
15
+
16
+ The following is a simple example of how to use OCRVerse for document parsing tasks.
17
+
18
+ ### Installation
19
+
20
+ ```shell
21
+ pip install "transformers>=4.57.0"
22
+ ```
23
+
24
+ ### Inference
25
+
26
+ ```python
27
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
28
+ import torch
29
+
30
+ # Load model
31
+ model_path = 'DocTron/OCRVerse'
32
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
33
+ model_path,
34
+ dtype="auto",
35
+ device_map="cuda",
36
+ trust_remote_code=True
37
+ )
38
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
39
+
40
+ # Prepare input with image and text
41
+ image_path = "YOUR_IMAGE_PATH"
42
+ # We recommend using the following prompt to better performance, since it is used throughout the training process.
43
+ prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
44
+
45
+ messages = [
46
+ {
47
+ "role": "user",
48
+ "content": [
49
+ {"type": "image", "image": image_path},
50
+ {"type": "text", "text": prompt},
51
+ ]
52
+ }
53
+ ]
54
+
55
+ # Preparation for inference
56
+ inputs = processor.apply_chat_template(
57
+ messages,
58
+ tokenize=True,
59
+ add_generation_prompt=True,
60
+ return_dict=True,
61
+ return_tensors="pt"
62
+ )
63
+ inputs = inputs.to(model.device)
64
+
65
+ # Inference: Generation of the output
66
+ generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
67
+
68
+ generated_ids = [
69
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
70
+ ]
71
+ output_text = processor.tokenizer.batch_decode(
72
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
73
+ )
74
+ print(output_text[0])
75
+ ```
76
+
77
+ ## Citation
78
+
79
+ ```bibtex
80
+ @misc{zhong2026ocrverse,
81
+ title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
82
+ author={Yufeng Zhong and Lei Chen and Xuanle Zhao and Wenkang Han and Liming Zheng and Jing Huang and Deyang Jiang and Yilin Cao and Lin Ma and Zhixiong Zeng},
83
+ year={2026},
84
+ eprint={2601.21639},
85
+ archivePrefix={arXiv},
86
+ primaryClass={cs.CV},
87
+ url={https://arxiv.org/abs/2601.21639},
88
+ }
89
+ ```