webdev8710 commited on
Commit
13f6d79
·
verified ·
1 Parent(s): cc2e035

Upload 4 files

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. DeepSeek_OCR_paper.pdf +3 -0
  3. LICENSE +21 -0
  4. README.md +238 -3
  5. requirements.txt +9 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ DeepSeek_OCR_paper.pdf filter=lfs diff=lfs merge=lfs -text
DeepSeek_OCR_paper.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5191ef012b406c86d7ca8cf5a286d24d9e758428a59d149913fe0517fa59c6ac
3
+ size 7591202
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 DeepSeek
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,238 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- markdownlint-disable first-line-h1 -->
2
+ <!-- markdownlint-disable html -->
3
+ <!-- markdownlint-disable no-duplicate-header -->
4
+
5
+
6
+ <div align="center">
7
+ <img src="assets/logo.svg" width="60%" alt="DeepSeek AI" />
8
+ </div>
9
+
10
+
11
+ <hr>
12
+ <div align="center">
13
+ <a href="https://www.deepseek.com/" target="_blank">
14
+ <img alt="Homepage" src="assets/badge.svg" />
15
+ </a>
16
+ <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR" target="_blank">
17
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
18
+ </a>
19
+
20
+ </div>
21
+
22
+ <div align="center">
23
+
24
+ <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
25
+ <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
26
+ </a>
27
+ <a href="https://twitter.com/deepseek_ai" target="_blank">
28
+ <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
29
+ </a>
30
+
31
+ </div>
32
+
33
+
34
+
35
+ <p align="center">
36
+ <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"><b>📥 Model Download</b></a> |
37
+ <a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf"><b>📄 Paper Link</b></a> |
38
+ <a href="https://arxiv.org/abs/2510.18234"><b>📄 Arxiv Paper Link</b></a> |
39
+ </p>
40
+
41
+ <h2>
42
+ <p align="center">
43
+ <a href="">DeepSeek-OCR: Contexts Optical Compression</a>
44
+ </p>
45
+ </h2>
46
+
47
+ <p align="center">
48
+ <img src="assets/fig1.png" style="width: 1000px" align=center>
49
+ </p>
50
+ <p align="center">
51
+ <a href="">Explore the boundaries of visual-text compression.</a>
52
+ </p>
53
+
54
+ ## Release
55
+ - [2025/10/23]🚀🚀🚀 DeepSeek-OCR is now officially supported in upstream [vLLM](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html#installing-vllm). Thanks to the [vLLM](https://github.com/vllm-project/vllm) team for their help.
56
+ - [2025/10/20]🚀🚀🚀 We release DeepSeek-OCR, a model to investigate the role of vision encoders from an LLM-centric viewpoint.
57
+
58
+ ## Contents
59
+ - [Install](#install)
60
+ - [vLLM Inference](#vllm-inference)
61
+ - [Transformers Inference](#transformers-inference)
62
+
63
+
64
+
65
+
66
+
67
+ ## Install
68
+ >Our environment is cuda11.8+torch2.6.0.
69
+ 1. Clone this repository and navigate to the DeepSeek-OCR folder
70
+ ```bash
71
+ git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
72
+ ```
73
+ 2. Conda
74
+ ```Shell
75
+ conda create -n deepseek-ocr python=3.12.9 -y
76
+ conda activate deepseek-ocr
77
+ ```
78
+ 3. Packages
79
+
80
+ - download the vllm-0.8.5 [whl](https://github.com/vllm-project/vllm/releases/tag/v0.8.5)
81
+ ```Shell
82
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
83
+ pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
84
+ pip install -r requirements.txt
85
+ pip install flash-attn==2.7.3 --no-build-isolation
86
+ ```
87
+ **Note:** if you want vLLM and transformers codes to run in the same environment, you don't need to worry about this installation error like: vllm 0.8.5+cu118 requires transformers>=4.51.1
88
+
89
+ ## vLLM-Inference
90
+ - VLLM:
91
+ >**Note:** change the INPUT_PATH/OUTPUT_PATH and other settings in the DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py
92
+ ```Shell
93
+ cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
94
+ ```
95
+ 1. image: streaming output
96
+ ```Shell
97
+ python run_dpsk_ocr_image.py
98
+ ```
99
+ 2. pdf: concurrency ~2500tokens/s(an A100-40G)
100
+ ```Shell
101
+ python run_dpsk_ocr_pdf.py
102
+ ```
103
+ 3. batch eval for benchmarks
104
+ ```Shell
105
+ python run_dpsk_ocr_eval_batch.py
106
+ ```
107
+
108
+ **[2025/10/23] The version of upstream [vLLM](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html#installing-vllm):**
109
+
110
+ ```shell
111
+ uv venv
112
+ source .venv/bin/activate
113
+ # Until v0.11.1 release, you need to install vLLM from nightly build
114
+ uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
115
+ ```
116
+
117
+ ```python
118
+ from vllm import LLM, SamplingParams
119
+ from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
120
+ from PIL import Image
121
+
122
+ # Create model instance
123
+ llm = LLM(
124
+ model="deepseek-ai/DeepSeek-OCR",
125
+ enable_prefix_caching=False,
126
+ mm_processor_cache_gb=0,
127
+ logits_processors=[NGramPerReqLogitsProcessor]
128
+ )
129
+
130
+ # Prepare batched input with your image file
131
+ image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
132
+ image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
133
+ prompt = "<image>\nFree OCR."
134
+
135
+ model_input = [
136
+ {
137
+ "prompt": prompt,
138
+ "multi_modal_data": {"image": image_1}
139
+ },
140
+ {
141
+ "prompt": prompt,
142
+ "multi_modal_data": {"image": image_2}
143
+ }
144
+ ]
145
+
146
+ sampling_param = SamplingParams(
147
+ temperature=0.0,
148
+ max_tokens=8192,
149
+ # ngram logit processor args
150
+ extra_args=dict(
151
+ ngram_size=30,
152
+ window_size=90,
153
+ whitelist_token_ids={128821, 128822}, # whitelist: <td>, </td>
154
+ ),
155
+ skip_special_tokens=False,
156
+ )
157
+ # Generate output
158
+ model_outputs = llm.generate(model_input, sampling_param)
159
+
160
+ # Print output
161
+ for output in model_outputs:
162
+ print(output.outputs[0].text)
163
+ ```
164
+ ## Transformers-Inference
165
+ - Transformers
166
+ ```python
167
+ from transformers import AutoModel, AutoTokenizer
168
+ import torch
169
+ import os
170
+ os.environ["CUDA_VISIBLE_DEVICES"] = '0'
171
+ model_name = 'deepseek-ai/DeepSeek-OCR'
172
+
173
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
174
+ model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
175
+ model = model.eval().cuda().to(torch.bfloat16)
176
+
177
+ # prompt = "<image>\nFree OCR. "
178
+ prompt = "<image>\n<|grounding|>Convert the document to markdown. "
179
+ image_file = 'your_image.jpg'
180
+ output_path = 'your/output/dir'
181
+
182
+ res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
183
+ ```
184
+ or you can
185
+ ```Shell
186
+ cd DeepSeek-OCR-master/DeepSeek-OCR-hf
187
+ python run_dpsk_ocr.py
188
+ ```
189
+ ## Support-Modes
190
+ The current open-source model supports the following modes:
191
+ - Native resolution:
192
+ - Tiny: 512×512 (64 vision tokens)✅
193
+ - Small: 640×640 (100 vision tokens)✅
194
+ - Base: 1024×1024 (256 vision tokens)✅
195
+ - Large: 1280×1280 (400 vision tokens)✅
196
+ - Dynamic resolution
197
+ - Gundam: n×640×640 + 1×1024×1024 ✅
198
+
199
+ ## Prompts examples
200
+ ```python
201
+ # document: <image>\n<|grounding|>Convert the document to markdown.
202
+ # other image: <image>\n<|grounding|>OCR this image.
203
+ # without layouts: <image>\nFree OCR.
204
+ # figures in document: <image>\nParse the figure.
205
+ # general: <image>\nDescribe this image in detail.
206
+ # rec: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.
207
+ # '先天下之忧而忧'
208
+ ```
209
+
210
+
211
+ ## Visualizations
212
+ <table>
213
+ <tr>
214
+ <td><img src="assets/show1.jpg" style="width: 500px"></td>
215
+ <td><img src="assets/show2.jpg" style="width: 500px"></td>
216
+ </tr>
217
+ <tr>
218
+ <td><img src="assets/show3.jpg" style="width: 500px"></td>
219
+ <td><img src="assets/show4.jpg" style="width: 500px"></td>
220
+ </tr>
221
+ </table>
222
+
223
+
224
+ ## Acknowledgement
225
+
226
+ We would like to thank [Vary](https://github.com/Ucas-HaoranWei/Vary/), [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/), [MinerU](https://github.com/opendatalab/MinerU), [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [OneChart](https://github.com/LingyvKong/OneChart), [Slow Perception](https://github.com/Ucas-HaoranWei/Slow-Perception) for their valuable models and ideas.
227
+
228
+ We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
229
+
230
+ ## Citation
231
+
232
+ ```bibtex
233
+ @article{wei2025deepseek,
234
+ title={DeepSeek-OCR: Contexts Optical Compression},
235
+ author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
236
+ journal={arXiv preprint arXiv:2510.18234},
237
+ year={2025}
238
+ }
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ transformers==4.46.3
2
+ tokenizers==0.20.3
3
+ PyMuPDF
4
+ img2pdf
5
+ einops
6
+ easydict
7
+ addict
8
+ Pillow
9
+ numpy