jonathanli commited on
Commit
7175ef2
·
verified ·
1 Parent(s): 92d6668

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - deepseek-ai/DeepSeek-OCR
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ tags:
10
+ - pytorch
11
+ - text-generation-inference
12
+ - ocr
13
+ - document
14
+ - vlm
15
+ - extraction
16
+ - markdown
17
+ - layout-detection
18
+ - huggingface-hub
19
+ ---
20
+
21
+ ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/11mAxaSVGjxm-7eD4Y-EJ.png)
22
+
23
+ # **DeepSeek-OCR-Latest-BF16.I64**
24
+
25
+ > **DeepSeek-OCR-Latest-BF16.I64** is an optimized and updated version of the original [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR). It is an open-source vision-language OCR model designed to extract text from images and scanned documents—including both digital and handwritten content—and can output results as plain text or Markdown. This model leverages a powerful multimodal backbone (**3B VLM**) to improve reading comprehension and layout understanding for both typed and cursive handwriting. It also excels at preserving document structures such as **headings, tables, and lists** in its outputs.
26
+
27
+ The **BF16 variant** has been updated and tested with the following environment:
28
+
29
+ ```
30
+ transformers: 4.57.1
31
+ torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0)
32
+ cuda: 12.4
33
+ device: NVIDIA H200 MIG 3g.71gb
34
+ ```
35
+
36
+ This version allows flexible configuration of attention implementations—such as `flash_attention` or `sdpa`—for performance optimization or standardization. Users can also **opt out** of specific attention implementations if desired.
37
+
38
+ ## Quick Start with Transformers 🤗
39
+
40
+ #### Install the required packages
41
+
42
+ ```
43
+ gradio
44
+ torch
45
+ torchvision
46
+ transformers==4.57.1
47
+ accelerate
48
+ matplotlib
49
+ einops
50
+ addict
51
+ easydict
52
+ ```
53
+
54
+ ### Run Demo
55
+
56
+ ```py
57
+ import gradio as gr
58
+ import torch
59
+ import requests
60
+ from transformers import AutoModel, AutoTokenizer
61
+ from typing import Iterable
62
+ import os
63
+ import tempfile
64
+ from PIL import Image, ImageDraw
65
+ import re
66
+ from gradio.themes import Soft
67
+ from gradio.themes.utils import colors, fonts, sizes
68
+
69
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
70
+
71
+ css = """
72
+ #main-title h1 {
73
+ font-size: 2.3em !important;
74
+ }
75
+ #output-title h2 {
76
+ font-size: 2.1em !important;
77
+ }
78
+ """
79
+
80
+ print("Determining device...")
81
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
82
+ print(f"✅ Using device: {device}")
83
+
84
+ print("Loading model and tokenizer...")
85
+ model_name = "prithivMLmods/DeepSeek-OCR-Latest-BF16.I64"
86
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
87
+
88
+ model = AutoModel.from_pretrained(
89
+ model_name,
90
+ #_attn_implementation="flash_attention_2",
91
+ trust_remote_code=True,
92
+ use_safetensors=True,
93
+ ).to(device).eval() # Move to device and set to eval mode
94
+
95
+ if device.type == 'cuda':
96
+ model = model.to(torch.bfloat16)
97
+
98
+ print("✅ Model loaded successfully to device and in eval mode.")
99
+
100
+ def find_result_image(path):
101
+ for filename in os.listdir(path):
102
+ if "grounding" in filename or "result" in filename:
103
+ try:
104
+ image_path = os.path.join(path, filename)
105
+ return Image.open(image_path)
106
+ except Exception as e:
107
+ print(f"Error opening result image {filename}: {e}")
108
+ return None
109
+
110
+ def process_ocr_task(image, model_size, task_type, ref_text):
111
+ """
112
+ Processes an image with DeepSeek-OCR. The model is already on the correct device.
113
+ """
114
+ if image is None:
115
+ return "Please upload an image first.", None
116
+
117
+ print("✅ Model is already on the designated device.")
118
+
119
+ with tempfile.TemporaryDirectory() as output_path:
120
+ # Build the prompt
121
+ if task_type == "Free OCR":
122
+ prompt = "<image>\nFree OCR."
123
+ elif task_type == "Convert to Markdown":
124
+ prompt = "<image>\n<|grounding|>Convert the document to markdown."
125
+ elif task_type == "Parse Figure":
126
+ prompt = "<image>\nParse the figure."
127
+ elif task_type == "Locate Object by Reference":
128
+ if not ref_text or ref_text.strip() == "":
129
+ raise gr.Error("For the 'Locate' task, you must provide the reference text to find!")
130
+ prompt = f"<image>\nLocate <|ref|>{ref_text.strip()}<|/ref|> in the image."
131
+ else:
132
+ prompt = "<image>\nFree OCR."
133
+
134
+ temp_image_path = os.path.join(output_path, "temp_image.png")
135
+ image.save(temp_image_path)
136
+
137
+ size_configs = {
138
+ "Tiny": {"base_size": 512, "image_size": 512, "crop_mode": False},
139
+ "Small": {"base_size": 640, "image_size": 640, "crop_mode": False},
140
+ "Base": {"base_size": 1024, "image_size": 1024, "crop_mode": False},
141
+ "Large": {"base_size": 1280, "image_size": 1280, "crop_mode": False},
142
+ "Gundam (Recommended)": {"base_size": 1024, "image_size": 640, "crop_mode": True},
143
+ }
144
+ config = size_configs.get(model_size, size_configs["Gundam (Recommended)"])
145
+
146
+ print(f"🏃 Running inference with prompt: {prompt}")
147
+ text_result = model.infer(
148
+ tokenizer,
149
+ prompt=prompt,
150
+ image_file=temp_image_path,
151
+ output_path=output_path,
152
+ base_size=config["base_size"],
153
+ image_size=config["image_size"],
154
+ crop_mode=config["crop_mode"],
155
+ save_results=True,
156
+ test_compress=True,
157
+ eval_mode=True,
158
+ )
159
+
160
+ print(f"====\n📄 Text Result: {text_result}\n====")
161
+
162
+ result_image_pil = None
163
+ pattern = re.compile(r"<\|det\|>\[\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]\]<\|/det\|>")
164
+ matches = list(pattern.finditer(text_result))
165
+
166
+ if matches:
167
+ print(f"✅ Found {len(matches)} bounding box(es). Drawing on the original image.")
168
+ image_with_bboxes = image.copy()
169
+ draw = ImageDraw.Draw(image_with_bboxes)
170
+ w, h = image.size
171
+
172
+ for match in matches:
173
+ coords_norm = [int(c) for c in match.groups()]
174
+ x1_norm, y1_norm, x2_norm, y2_norm = coords_norm
175
+
176
+ x1 = int(x1_norm / 1000 * w)
177
+ y1 = int(y1_norm / 1000 * h)
178
+ x2 = int(x2_norm / 1000 * w)
179
+ y2 = int(y2_norm / 1000 * h)
180
+
181
+ draw.rectangle([x1, y1, x2, y2], outline="red", width=3)
182
+
183
+ result_image_pil = image_with_bboxes
184
+ else:
185
+ print("⚠️ No bounding box coordinates found in text result. Falling back to search for a result image file.")
186
+ result_image_pil = find_result_image(output_path)
187
+
188
+ return text_result, result_image_pil
189
+
190
+ with gr.Blocks(css=css) as demo:
191
+ gr.Markdown("# **DeepSeek OCR [exp]**", elem_id="main-title")
192
+
193
+ with gr.Row():
194
+ with gr.Column(scale=1):
195
+ image_input = gr.Image(type="pil", label="Upload Image", sources=["upload", "clipboard"])
196
+ model_size = gr.Dropdown(choices=["Tiny", "Small", "Base", "Large", "Gundam (Recommended)"], value="Large", label="Resolution Size")
197
+ task_type = gr.Dropdown(choices=["Free OCR", "Convert to Markdown", "Parse Figure", "Locate Object by Reference"], value="Convert to Markdown", label="Task Type")
198
+ ref_text_input = gr.Textbox(label="Reference Text (for Locate task)", placeholder="e.g., the teacher, 20-10, a red car...", visible=False)
199
+ submit_btn = gr.Button("Process Image", variant="primary")
200
+
201
+ with gr.Column(scale=2):
202
+ output_text = gr.Textbox(label="Output (OCR)", lines=8, show_copy_button=True)
203
+ output_image = gr.Image(label="Layout Detection (If Any)", type="pil")
204
+
205
+ with gr.Accordion("Note", open=False):
206
+ gr.Markdown("Inference using Huggingface transformers on NVIDIA GPUs. This app is running with transformers version 4.57.1 and torch version 2.6.0.")
207
+
208
+ def toggle_ref_text_visibility(task):
209
+ return gr.Textbox(visible=True) if task == "Locate Object by Reference" else gr.Textbox(visible=False)
210
+
211
+ task_type.change(fn=toggle_ref_text_visibility, inputs=task_type, outputs=ref_text_input)
212
+ submit_btn.click(fn=process_ocr_task, inputs=[image_input, model_size, task_type, ref_text_input], outputs=[output_text, output_image])
213
+
214
+ if __name__ == "__main__":
215
+ demo.queue(max_size=20).launch(share=True, mcp_server=True, ssr_mode=False)
216
+ ```
217
+
218
+ ## Model and Resource Links
219
+
220
+ | Resource Type | Description | Link |
221
+ |----------------|--------------|------|
222
+ | Original Model Card | Official DeepSeek-OCR release by deepseek-ai | [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) |
223
+ | Test Model (StrangerZone HF) | Community test deployment (experimental) | [strangervisionhf/deepseek-ocr-latest-transformers](https://huggingface.co/strangervisionhf/deepseek-ocr-latest-transformers) |
224
+ | Standard Model Card | Optimized version supporting Transformers v4.57.1 (BF16 precision) | [DeepSeek-OCR-Latest-BF16.I64](https://huggingface.co/prithivMLmods/DeepSeek-OCR-Latest-BF16.I64) |
225
+ | Research Paper | DeepSeek-OCR: Contexts Optical Compression | [arXiv:2510.18234](https://huggingface.co/papers/2510.18234) |
226
+ | Demo Space | Interactive demo hosted on Hugging Face Spaces | [DeepSeek-OCR Experimental Demo](https://huggingface.co/spaces/prithivMLmods/DeepSeek-OCR-experimental) |
config.json ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DeepseekOCRForCausalLMWithImages"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "aux_loss_alpha": 0.001,
8
+ "bos_token_id": 0,
9
+ "candidate_resolutions": [
10
+ [
11
+ 1024,
12
+ 1024
13
+ ]
14
+ ],
15
+ "dtype": "bfloat16",
16
+ "eos_token_id": 1,
17
+ "first_k_dense_replace": 1,
18
+ "global_view_pos": "head",
19
+ "head_dim": 128,
20
+ "hidden_act": "silu",
21
+ "hidden_size": 1280,
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 6848,
24
+ "kv_lora_rank": null,
25
+ "language_config": {
26
+ "architectures": [
27
+ "DeepseekV2ForCausalLM"
28
+ ],
29
+ "bos_token_id": 0,
30
+ "eos_token_id": 1,
31
+ "first_k_dense_replace": 1,
32
+ "hidden_size": 1280,
33
+ "intermediate_size": 6848,
34
+ "kv_lora_rank": null,
35
+ "lm_head": true,
36
+ "max_position_embeddings": 8192,
37
+ "moe_intermediate_size": 896,
38
+ "n_group": 1,
39
+ "n_routed_experts": 64,
40
+ "n_shared_experts": 2,
41
+ "num_attention_heads": 10,
42
+ "num_experts_per_tok": 6,
43
+ "num_hidden_layers": 12,
44
+ "num_key_value_heads": 10,
45
+ "q_lora_rank": null,
46
+ "qk_nope_head_dim": 0,
47
+ "qk_rope_head_dim": 0,
48
+ "rm_head": false,
49
+ "topk_group": 1,
50
+ "topk_method": "greedy",
51
+ "torch_dtype": "bfloat16",
52
+ "use_mla": false,
53
+ "v_head_dim": 0,
54
+ "vocab_size": 129280
55
+ },
56
+ "lm_head": true,
57
+ "max_position_embeddings": 8192,
58
+ "model_type": "deepseek_ocr",
59
+ "mlp_bias": false,
60
+ "moe_intermediate_size": 896,
61
+ "n_group": 1,
62
+ "n_routed_experts": 64,
63
+ "n_shared_experts": 2,
64
+ "norm_topk_prob": false,
65
+ "num_attention_heads": 10,
66
+ "num_experts_per_tok": 6,
67
+ "num_hidden_layers": 12,
68
+ "num_key_value_heads": 10,
69
+ "projector_config": {
70
+ "input_dim": 2048,
71
+ "model_type": "mlp_projector",
72
+ "n_embed": 1280,
73
+ "projector_type": "linear"
74
+ },
75
+ "q_lora_rank": null,
76
+ "qk_nope_head_dim": 0,
77
+ "qk_rope_head_dim": 0,
78
+ "rm_head": false,
79
+ "rms_norm_eps": 1e-06,
80
+ "rope_scaling": null,
81
+ "rope_theta": 10000.0,
82
+ "routed_scaling_factor": 1.0,
83
+ "seq_aux": true,
84
+ "tie_word_embeddings": false,
85
+ "tile_tag": "2D",
86
+ "topk_group": 1,
87
+ "topk_method": "greedy",
88
+ "transformers_version": "4.57.1",
89
+ "use_cache": true,
90
+ "use_mla": false,
91
+ "v_head_dim": 0,
92
+ "vision_config": {
93
+ "image_size": 1024,
94
+ "mlp_ratio": 3.7362,
95
+ "model_name": "deeplip_b_l",
96
+ "model_type": "vision",
97
+ "width": {
98
+ "clip-l-14-224": {
99
+ "heads": 16,
100
+ "image_size": 224,
101
+ "layers": 24,
102
+ "patch_size": 14,
103
+ "width": 1024
104
+ },
105
+ "sam_vit_b": {
106
+ "downsample_channels": [
107
+ 512,
108
+ 1024
109
+ ],
110
+ "global_attn_indexes": [
111
+ 2,
112
+ 5,
113
+ 8,
114
+ 11
115
+ ],
116
+ "heads": 12,
117
+ "layers": 12,
118
+ "width": 768
119
+ }
120
+ }
121
+ },
122
+ "vocab_size": 129280
123
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "transformers_version": "4.57.1"
6
+ }
model-00001-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:434bc1cedd423e845da7330bf6318c9cac93005b3e87c56b389e10258037542c
3
+ size 1498396488
model-00002-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6057c19818470d2266e8bf97dc71a06cc4e9ad54431afbdb4e0533099faa6eaa
3
+ size 1498738600
model-00003-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dbac95949f58f750d35fadb7279da6d08cb4d56a7140d21c35040959dee74e6
3
+ size 1498738600
model-00004-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:845d6b1d20f7f7ffc5f3286bf241ebf6628ee3d013f38b87e34ac8980d22618f
3
+ size 1496146096
model-00005-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78f470c3bfa68a500d5797ae9e2e8dc2adb0949ef42ab3140e9c9c22796fc968
3
+ size 680526088
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|User|>",
4
+ "<|Assistant|>"
5
+ ],
6
+ "bos_token": {
7
+ "content": "<|begin▁of▁sentence|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "eos_token": {
14
+ "content": "<|end▁of▁sentence|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "pad_token": {
21
+ "content": "<|▁pad▁|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ }
27
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff