File size: 7,031 Bytes
76e5000
 
 
 
 
 
 
 
 
 
 
 
3a00d0c
d976618
 
4cd8323
d976618
 
76e5000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119d30c
 
 
 
 
 
 
 
 
 
 
cf5898c
 
 
 
 
 
119d30c
 
 
 
 
 
 
 
 
 
 
 
 
cf5898c
 
119d30c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf5898c
119d30c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
pipeline_tag: image-text-to-text
language:
- multilingual
tags:
- got
- vision-language
- ocr2.0
- custom_code
license: apache-2.0
---

## Same model as [stepfun-ai/GOT-OCR2_0](https://huggingface.co/stepfun-ai/GOT-OCR2_0) but using custom source code:

1. Remove the `verovio` dependency since most people don't need to OCR musical annotation.
2. Allow a user to use `float16` if their GPU doesn't support `bfloat16`.
3. Updated to support `Transformers==4.48.3` so it no longer gives a bunch of warning messages.


<details><summary>ORIGINAL MODEL CARD HERE</summary>


<h1>General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
</h1>

[🔋Online Demo](https://huggingface.co/spaces/ucaslcl/GOT_online) | [🌟GitHub](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/) | [📜Paper](https://arxiv.org/abs/2409.01704)</a> 


[Haoran Wei*](https://scholar.google.com/citations?user=J4naK0MAAAAJ&hl=en), Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu,  [Zheng Ge](https://joker316701882.github.io/), Liang Zhao, [Jianjian Sun](https://scholar.google.com/citations?user=MVZrGkYAAAAJ&hl=en), [Yuang Peng](https://scholar.google.com.hk/citations?user=J0ko04IAAAAJ&hl=zh-CN&oi=ao), Chunrui Han, [Xiangyu Zhang](https://scholar.google.com/citations?user=yuB-cfoAAAAJ&hl=en)



![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6653eee7a2d7a882a805ab95/QCEFY-M_YG3Bp5fn1GQ8X.jpeg)



## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
```
torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0
```


```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval().cuda()


# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

# format texts OCR:
# res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR:
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

# multi-crop OCR:
# res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
# res = model.chat_crop(tokenizer, image_file, ocr_type='format')

# render the formatted OCR results:
# res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)


```
More details about 'ocr_type', 'ocr_box', 'ocr_color', and 'render' can be found at our GitHub.
Our training codes are available at our [GitHub](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/).



## More Multimodal Projects

👏 Welcome to explore more multimodal projects of our team:

[Vary](https://github.com/Ucas-HaoranWei/Vary) | [Fox](https://github.com/ucaslcl/Fox) | [OneChart](https://github.com/LingyvKong/OneChart)

## Citation

If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!

```bib
@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}
```

</details>
<br>

# Example Usage

```python
import fitz
from PIL import Image
from transformers import AutoModel, AutoTokenizer
import torch

# The following three lines are optional - removes the last remaining logging message from Transformers.
# import warnings
# from transformers import logging as transformers_logging
# transformers_logging.set_verbosity_error()

MODEL_PATH = "ctranslate2-4you/GOT-OCR2_0-Customized"  # Replace with local path if desired
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    device_map='cuda',
    use_safetensors=True,
    pad_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>")
)
model = model.eval().cuda()

def clean_repetitive_lines(text):
    """
    Removes repetitive lines from the OCR output before saving the .txt file. This is necessary because
    the model sometimes produces OCR artifacts.  All duplicates above 2 instances are removed.
    """
    lines = text.split('\n')
    cleaned_lines = []
    i = 0
    while i < len(lines):
        cleaned_lines.append(lines[i])
        repeat_count = 1
        j = i + 1
        while j < len(lines) and lines[j] == lines[i]:
            repeat_count += 1
            j += 1
        if repeat_count > 2:
            if i + 1 < len(lines):
                cleaned_lines.append(lines[i + 1])
            i = j
        else:
            i += 1
    return '\n'.join(cleaned_lines)

@torch.inference_mode()
def process_pdf_for_ocr(tokenizer, model, pdf_path):
    pdf_document = fitz.open(pdf_path)
    full_text = []
    
    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]
        zoom = 2
        matrix = fitz.Matrix(zoom, zoom)
        pix = page.get_pixmap(matrix=matrix)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        # gradio_input=True is used because we're creating images for each page of a .pdf using PyMuPDF and Pillow instead of relying on the model's internal code
        res = model.chat_crop(tokenizer, img, ocr_type='ocr', gradio_input=True)
        
        if res.strip():
            full_text.append(res)

    complete_text = '\n'.join(full_text)
    cleaned_text = clean_repetitive_lines(complete_text)
    
    with open("extracted_text_got_ocr.txt", "w", encoding="utf-8") as f:
        f.write(cleaned_text)
    
    pdf_document.close()
    print("Results have been saved to extracted_text_got_ocr.txt")

# Example usage
pdf_path = "path/to/your/pdf"
process_pdf_for_ocr(tokenizer, model, pdf_path)
```