kristaller486 commited on
Commit
dc3f34c
·
verified ·
1 Parent(s): 5f7c324

update readme

Browse files
Files changed (1) hide show
  1. README.md +739 -737
README.md CHANGED
@@ -1,738 +1,740 @@
1
- ---
2
- license: mit
3
- library_name: dots_ocr_1_5
4
- pipeline_tag: image-text-to-text
5
- tags:
6
- - image-to-text
7
- - ocr
8
- - document-parse
9
- - layout
10
- - table
11
- - formula
12
- - transformers
13
- - custom_code
14
- language:
15
- - en
16
- - zh
17
- - multilingual
18
- ---
19
-
20
- <div align="center">
21
-
22
- <p align="center">
23
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/logo.png" width="300"/>
24
- <p>
25
-
26
- <h1 align="center">
27
- dots.ocr-1.5: Recognize Any Human Scripts and Symbols
28
- </h1>
29
-
30
- [![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.ocr-1.5)
31
- [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/rednote-hilab/dots.ocr)
32
-
33
- <div align="center">
34
- <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> |
35
- <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
36
- <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
37
- </div>
38
-
39
- </div>
40
-
41
-
42
-
43
- ## Introduction
44
-
45
- We present **dots.ocr-1.5**, a 3B-parameter multimodal model composed of a 1.2B vision encoder and a 1.7B language model. Designed for universal accessibility, it possesses the capability to recognize virtually any human script. Beyond achieving state-of-the-art (SOTA) performance in standard multilingual document parsing among models of comparable size, dots.ocr-1.5 excels at converting structured graphics (e.g., charts and diagrams) directly into SVG code, parsing web screens and spotting scene text. Furthermore, the model demonstrates competitive performance in general OCR, object grounding & counting tasks.
46
-
47
- 1. **Stronger Document Parsing Performance:** dots.ocr-1.5 maintains SOTA performance among latest OCR models, particularly on **multilingual documents**. Addressing the significant bias inherent in the detection & matching rules of certain benchmarks —which often fail to accurately reflect a model's true capabilities—we adopted an **Elo score** evaluation system. Under this metric, the performance landscape shifts significantly, highlighting the superior robustness of our model compared to conventional rankings.
48
- 2. **Unified Vision-Language Parsing**: Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge, akin to natural language. dots.ocr-1.5 unifies the interpretation of these elements by parsing them directly into SVG code. We have validated the effectiveness of this approach, demonstrating impressive results in structural and semantic recognition.
49
- 3. **Broader and More General Capabilities**: Compared to dots.ocr, dots.ocr-1.5 supports a significantly wider array of tasks. It extends beyond standard OCR to handle web screen parsing, scene text spotting, object grounding & counting, and other general OCR QA tasks.
50
-
51
-
52
- ## Evaluation
53
-
54
- ### 1. Document Parsing
55
-
56
- #### 1.1 Elo Score of different bench between latest models
57
- <table>
58
- <thead>
59
- <tr>
60
- <th>models</th>
61
- <th>olmOCR-Bench</th>
62
- <th>OmniDocBench (v1.5)</th>
63
- <th>XDocParse</th>
64
- </tr>
65
- </thead>
66
- <tbody>
67
- <tr>
68
- <td>GLM-OCR</td>
69
- <td>859.9</td>
70
- <td>937.5</td>
71
- <td>742.1</td>
72
- </tr>
73
- <tr>
74
- <td>PaddleOCR-VL-1.5</td>
75
- <td>873.6</td>
76
- <td>965.6</td>
77
- <td>797.6</td>
78
- </tr>
79
- <tr>
80
- <td>HuanyuanOCR</td>
81
- <td>978.9</td>
82
- <td>974.4</td>
83
- <td>895.9</td>
84
- </tr>
85
- <tr>
86
- <td>dots.ocr</td>
87
- <td>1027.4</td>
88
- <td>994.7</td>
89
- <td>1133.4</td>
90
- </tr>
91
- <!-- Highlighting dots.ocr-1.5 row with bold tags -->
92
- <tr>
93
- <td><strong>dots.ocr-1.5</strong></td>
94
- <td><strong>1089.0</strong></td>
95
- <td><strong>1025.8</strong></td>
96
- <td><strong>1157.1</strong></td>
97
- </tr>
98
- <tr>
99
- <td>Gemini 3 Pro</td>
100
- <td>1171.2</td>
101
- <td>1102.1</td>
102
- <td>1273.9</td>
103
- </tr>
104
- </tbody>
105
- </table>
106
-
107
-
108
- > **Notes:**
109
- > - Results for Gemini 3 Pro, PaddleOCR-VL-1.5, and GLM-OCR were obtained via APIs, while HuanyuanOCR results were generated using local inference.
110
- > - The Elo score evaluation was conducted using Gemini 3 Flash. The prompt can be found at: [Elo Score Prompt](https://github.com/rednote-hilab/dots.ocr/blob/master/tools/elo_score_prompt.py). These results are consistent with the findings on [ocrarena](https://www.ocrarena.ai/battle).
111
-
112
-
113
- #### 1.2 olmOCR-bench
114
- <table>
115
- <thead>
116
- <tr>
117
- <th>Model</th>
118
- <th>ArXiv</th>
119
- <th>Old scans math</th>
120
- <th>Tables</th>
121
- <th>Old scans</th>
122
- <th>Headers & footers</th>
123
- <th>Multi column</th>
124
- <th>Long tiny text</th>
125
- <th>Base</th>
126
- <th>Overall</th>
127
- </tr>
128
- </thead>
129
- <tbody>
130
- <tr>
131
- <td>Mistral OCR API</td>
132
- <td>77.2</td>
133
- <td>67.5</td>
134
- <td>60.6</td>
135
- <td>29.3</td>
136
- <td>93.6</td>
137
- <td>71.3</td>
138
- <td>77.1</td>
139
- <td>99.4</td>
140
- <td>72.1.1</td>
141
- </tr>
142
- <tr>
143
- <td>Marker 1.10.1</td>
144
- <td>83.8</td>
145
- <td>66.8</td>
146
- <td>72.9</td>
147
- <td>33.5</td>
148
- <td>86.6</td>
149
- <td>80.0</td>
150
- <td>85.7</td>
151
- <td>99.3</td>
152
- <td>76.1±1.1</td>
153
- </tr>
154
- <tr>
155
- <td>MinerU 2.5.4*</td>
156
- <td>76.6</td>
157
- <td>54.6</td>
158
- <td>84.9</td>
159
- <td>33.7</td>
160
- <td>96.6</td>
161
- <td>78.2</td>
162
- <td>83.5</td>
163
- <td>93.7</td>
164
- <td>75.2±1.1</td>
165
- </tr>
166
- <tr>
167
- <td>DeepSeek-OCR</td>
168
- <td>77.2</td>
169
- <td>73.6</td>
170
- <td>80.2</td>
171
- <td>33.3</td>
172
- <td>96.1</td>
173
- <td>66.4</td>
174
- <td>79.4</td>
175
- <td>99.8</td>
176
- <td>75.7±1.0</td>
177
- </tr>
178
- <tr>
179
- <td>Nanonets-OCR2-3B</td>
180
- <td>75.4</td>
181
- <td>46.1</td>
182
- <td>86.8</td>
183
- <td>40.9</td>
184
- <td>32.1</td>
185
- <td>81.9</td>
186
- <td>93.0</td>
187
- <td>99.6</td>
188
- <td>69.5±1.1</td>
189
- </tr>
190
- <tr>
191
- <td>PaddleOCR-VL*</td>
192
- <td>85.7</td>
193
- <td>71.0</td>
194
- <td>84.1</td>
195
- <td>37.8</td>
196
- <td>97.0</td>
197
- <td>79.9</td>
198
- <td>85.7</td>
199
- <td>98.5</td>
200
- <td>80.0±1.0</td>
201
- </tr>
202
- <tr>
203
- <td>Infinity-Parser 7B*</td>
204
- <td>84.4</td>
205
- <td>83.8</td>
206
- <td>85.0</td>
207
- <td>47.9</td>
208
- <td>88.7</td>
209
- <td>84.2</td>
210
- <td>86.4</td>
211
- <td>99.8</td>
212
- <td>82.5±?</td>
213
- </tr>
214
- <tr>
215
- <td>olmOCR v0.4.0</td>
216
- <td>83.0</td>
217
- <td>82.3</td>
218
- <td>84.9</td>
219
- <td>47.7</td>
220
- <td>96.1</td>
221
- <td>83.7</td>
222
- <td>81.9</td>
223
- <td>99.7</td>
224
- <td>82.4±1.1</td>
225
- </tr>
226
- <tr>
227
- <td>Chandra OCR 0.1.0*</td>
228
- <td>82.2</td>
229
- <td>80.3</td>
230
- <td>88.0</td>
231
- <td>50.4</td>
232
- <td>90.8</td>
233
- <td>81.2</td>
234
- <td>92.3</td>
235
- <td>99.9</td>
236
- <td>83.1±0.9</td>
237
- </tr>
238
- <tr>
239
- <td>dots.ocr</td>
240
- <td>82.1</td>
241
- <td>64.2</td>
242
- <td>88.3</td>
243
- <td>40.9</td>
244
- <td>94.1</td>
245
- <td>82.4</td>
246
- <td>81.2</td>
247
- <td>99.5</td>
248
- <td>79.1±1.0</td>
249
- </tr>
250
- <tr>
251
- <td><strong>dots.ocr-1.5</strong></td>
252
- <td><strong>85.9</strong></td>
253
- <td><strong>85.5</strong></td>
254
- <td><strong>90.7</strong></td>
255
- <td>48.2</td>
256
- <td>94.0</td>
257
- <td><strong>85.3</strong></td>
258
- <td>81.6</td>
259
- <td>99.7</td>
260
- <td><strong>83.9±0.9</strong></td>
261
- </tr>
262
- </tbody>
263
- </table>
264
-
265
-
266
- > **Note:**
267
- > - The metrics are from [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
268
- > - We delete the Page-header and Page-footer cells in the result markdown.
269
-
270
-
271
- #### 1.3 Other Benchmarks
272
-
273
- <table>
274
- <thead>
275
- <tr>
276
- <th>Model Type</th>
277
- <th>Methods</th>
278
- <th>Size</th>
279
- <th>OmniDocBench(v1.5)<br>TextEdit↓</th>
280
- <th>OmniDocBench(v1.5)<br>Read OrderEdit↓</th>
281
- <th>pdf-parse-bench</th>
282
- </tr>
283
- </thead>
284
- <tbody>
285
- <!-- GeneralVLMs Group (Reversed Order, 3 rows) -->
286
- <tr>
287
- <td rowspan="3"><strong>GeneralVLMs</strong></td>
288
- <td>Gemini-2.5 Pro</td>
289
- <td>-</td>
290
- <td>0.075</td>
291
- <td>0.097</td>
292
- <td>9.06</td>
293
- </tr>
294
- <tr>
295
- <td>Qwen3-VL-235B-A22B-Instruct</td>
296
- <td>235B</td>
297
- <td>0.069</td>
298
- <td>0.068</td>
299
- <td><strong>9.71</strong></td>
300
- </tr>
301
- <tr>
302
- <td>gemini3pro</td>
303
- <td>-</td>
304
- <td>0.066</td>
305
- <td>0.079</td>
306
- <td>9.68</td>
307
- </tr>
308
- <!-- SpecializedVLMs Group (Reversed Order, 12 rows) -->
309
- <tr>
310
- <td rowspan="12"><strong>SpecializedVLMs</strong></td>
311
- <td>Mistral OCR</td>
312
- <td>-</td>
313
- <td>0.164</td>
314
- <td>0.144</td>
315
- <td>8.84</td>
316
- </tr>
317
- <tr>
318
- <td>Deepseek-OCR</td>
319
- <td>3B</td>
320
- <td>0.073</td>
321
- <td>0.086</td>
322
- <td>8.26</td>
323
- </tr>
324
- <tr>
325
- <td>MonkeyOCR-3B</td>
326
- <td>3B</td>
327
- <td>0.075</td>
328
- <td>0.129</td>
329
- <td>9.27</td>
330
- </tr>
331
- <tr>
332
- <td>OCRVerse</td>
333
- <td>4B</td>
334
- <td>0.058</td>
335
- <td>0.071</td>
336
- <td>--</td>
337
- </tr>
338
- <tr>
339
- <td>MonkeyOCR-pro-3B</td>
340
- <td>3B</td>
341
- <td>0.075</td>
342
- <td>0.128</td>
343
- <td>-</td>
344
- </tr>
345
- <tr>
346
- <td>MinerU2.5</td>
347
- <td>1.2B</td>
348
- <td>0.047</td>
349
- <td>0.044</td>
350
- <td>-</td>
351
- </tr>
352
- <tr>
353
- <td>PaddleOCR-VL</td>
354
- <td>0.9B</td>
355
- <td>0.035</td>
356
- <td>0.043</td>
357
- <td>9.51</td>
358
- </tr>
359
- <tr>
360
- <td>HunyuanOCR</td>
361
- <td>0.9B</td>
362
- <td>0.042</td>
363
- <td>-</td>
364
- <td>-</td>
365
- </tr>
366
- <tr>
367
- <td>PaddleOCR-VL1.5</td>
368
- <td>0.9B</td>
369
- <td>0.035</td>
370
- <td>0.042</td>
371
- <td>-</td>
372
- </tr>
373
- <tr>
374
- <td>GLMOCR</td>
375
- <td>0.9B</td>
376
- <td>0.04</td>
377
- <td>0.043</td>
378
- <td>-</td>
379
- </tr>
380
- <tr>
381
- <td>dots.ocr</td>
382
- <td>3B</td>
383
- <td>0.048</td>
384
- <td>0.053</td>
385
- <td>9.29</td>
386
- </tr>
387
- <tr>
388
- <td><u><strong>dots.ocr-1.5</strong></u></td>
389
- <td>3B</td>
390
- <td><strong>0.031</strong></td>
391
- <td><strong>0.029</strong></td>
392
- <td>9.54</td>
393
- </tr>
394
- </tbody>
395
- </table>
396
-
397
- > **Note:**
398
- > - Metrics are sourced from [OmniDocBench](https://github.com/opendatalab/OmniDocBench) and other model publications. [pdf-parse-bench](https://github.com/phorn1/pdf-parse-bench) results are reproduced by Qwen3-VL-235B-A22B-Instruct.
399
- > - Formula and Table metrics for OmniDocBench1.5 are omitted due to their high sensitivity to detection and matching protocols.
400
-
401
-
402
- ### 2. Vision-Language Parsing
403
- Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge. **dots.ocr-1.5** unifies the interpretation of these elements by parsing them directly into **SVG code**.
404
-
405
- <table>
406
- <thead>
407
- <tr>
408
- <th rowspan="2" style="text-align: left;">Methods</th>
409
- <th colspan="3">Unisvg</th>
410
- <th rowspan="2">Chartmimic</th>
411
- <th rowspan="2">Design2Code</th>
412
- <th rowspan="2">Genexam</th>
413
- <th rowspan="2">SciGen</th>
414
- <th rowspan="2">ChemDraw</th>
415
- </tr>
416
- <tr>
417
- <th>Low-Level</th>
418
- <th>High-Level</th>
419
- <th>Score</th>
420
- </tr>
421
- </thead>
422
- <tbody>
423
- <tr>
424
- <td style="text-align: left;">OCRVerse</td>
425
- <td>0.632</td>
426
- <td>0.852</td>
427
- <td>0.763</td>
428
- <td>0.799</td>
429
- <td>-</td>
430
- <td>-</td>
431
- <td>-</td>
432
- <td>0.881</td>
433
- </tr>
434
- <tr>
435
- <td style="text-align: left;">Gemini 3 Pro</td>
436
- <td>0.563</td>
437
- <td>0.850</td>
438
- <td>0.735</td>
439
- <td>0.788</td>
440
- <td>0.760</td>
441
- <td>0.756</td>
442
- <td>0.783</td>
443
- <td>0.839</td>
444
- </tr>
445
- <tr>
446
- <td style="text-align: left;">dots.ocr-1.5</td>
447
- <td>0.850</td>
448
- <td>0.923</td>
449
- <td>0.894</td>
450
- <td>0.772</td>
451
- <td>0.801</td>
452
- <td>0.664</td>
453
- <td>0.660</td>
454
- <td>0.790</td>
455
- </tr>
456
- <tr>
457
- <td style="text-align: left;"><strong>dots.ocr-1.5-svg</strong></td>
458
- <td><strong>0.860</strong></td>
459
- <td><strong>0.931</strong></td>
460
- <td><strong>0.902</strong></td>
461
- <td><strong>0.905</strong></td>
462
- <td><strong>0.834</strong></td>
463
- <td><strong>0.8</strong></td>
464
- <td><strong>0.797</strong></td>
465
- <td><strong>0.901</strong></td>
466
- </tr>
467
- </tbody>
468
- </table>
469
-
470
-
471
- > **Note:**
472
- > - We use the ISVGEN metric from [UniSVG](https://ryanlijinke.github.io/) to evaluate the parsing result. For benchmarks that do not natively support image parsing, we use the original images as input, and calculate the ISVGEN score between the rendered output and the original image.
473
- > - [OCRVerse](https://github.com/DocTron-hub/OCRVerse) results are derived from various code formats (e.g., SVG, Python), whereas results for Gemini 3 Pro and dots.ocr-1.5 are based specifically on SVG code.
474
- > - Due to the capacity constraints of a 3B-parameter VLM, dots.ocr-1.5 may not excel in all tasks yet like svg. To complement this, we are simultaneously releasing dots.ocr-1.5-svg. We plan to further address these limitations in future updates.
475
-
476
-
477
- ### 3. General Vision Tasks
478
-
479
- <table>
480
- <thead>
481
- <tr>
482
- <th>Model</th>
483
- <th>CharXiv_descriptive</th>
484
- <th>CharXiv_reasoning</th>
485
- <th>OCR_Reasoning</th>
486
- <th>infovqa</th>
487
- <th>docvqa</th>
488
- <th>ChartQA</th>
489
- <th>OCRBench</th>
490
- <th>AI2D</th>
491
- <th>CountBenchQA</th>
492
- <th>refcoco</th>
493
- </tr>
494
- </thead>
495
- <tbody>
496
- <tr>
497
- <td>Qwen3vl-2b-instruct</td>
498
- <td>62.3</td>
499
- <td>26.8</td>
500
- <td>-</td>
501
- <td>72.4</td>
502
- <td>93.3</td>
503
- <td>-</td>
504
- <td>85.8</td>
505
- <td>76.9</td>
506
- <td>88.4</td>
507
- <td>-</td>
508
- </tr>
509
- <tr>
510
- <td><strong>dots.ocr-1.5</strong></td>
511
- <td>77.4</td>
512
- <td>55.3</td>
513
- <td>22.85</td>
514
- <td>73.76</td>
515
- <td>91.85</td>
516
- <td>83.2</td>
517
- <td>86.0</td>
518
- <td>82.16</td>
519
- <td>94.46</td>
520
- <td>80.03</td>
521
- </tr>
522
- </tbody>
523
- </table>
524
-
525
-
526
-
527
- # Quick Start
528
- ## 1. Installation
529
- ### Install dots.ocr-1.5
530
- ```shell
531
- conda create -n dots_ocr python=3.12
532
- conda activate dots_ocr
533
-
534
- git clone https://github.com/rednote-hilab/dots.ocr.git
535
- cd dots.ocr
536
-
537
- # Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
538
- pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
539
- pip install -e .
540
- ```
541
-
542
- If you have trouble with the installation, try our [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) for an easier setup, and follow these steps:
543
- ```shell
544
- git clone https://github.com/rednote-hilab/dots.ocr.git
545
- cd dots.ocr
546
- pip install -e .
547
- ```
548
-
549
-
550
- ### Download Model Weights
551
- > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR_1_5` instead of `dots.ocr-1.5`) for the model save path. This is a temporary workaround pending our integration with Transformers.
552
- ```shell
553
- python3 tools/download_model.py
554
- ```
555
-
556
-
557
- ## 2. Deployment
558
- ### vLLM inference
559
- We highly recommend using vllm for deployment and inference.
560
-
561
- ```shell
562
- # launch vllm server
563
- ## dots.ocr-1.5
564
- CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code
565
-
566
- ## dots.ocr-1.5-svg
567
- CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5-svg --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code
568
-
569
- # vllm api demo
570
- ## document parsing
571
- python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
572
- ## web parsing
573
- python3 ./demo/demo_vllm.py --prompt_mode prompt_web_parsing --image_path ./assets/showcase_dots_ocr_1_5/origin/webpage_1.png
574
- ## scene spoting
575
- python3 ./demo/demo_vllm.py --prompt_mode prompt_scene_spotting --image_path ./assets/showcase_dots_ocr_1_5/origin/scene_1.jpg
576
- ## image parsing with svg code
577
- python3 ./demo/demo_vllm_svg.py --prompt_mode prompt_image_to_svg
578
- ## general qa
579
- python3 ./demo/demo_vllm_general.py
580
- ```
581
-
582
- ### Hugginface inference
583
- ```shell
584
- python3 demo/demo_hf.py
585
- ```
586
-
587
- <details>
588
- <summary><b>Hugginface inference details</b></summary>
589
-
590
- ```python
591
- import torch
592
- from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
593
- from qwen_vl_utils import process_vision_info
594
- from dots_ocr.utils import dict_promptmode_to_prompt
595
-
596
- model_path = "./weights/DotsOCR_1_5"
597
- model = AutoModelForCausalLM.from_pretrained(
598
- model_path,
599
- attn_implementation="flash_attention_2",
600
- torch_dtype=torch.bfloat16,
601
- device_map="auto",
602
- trust_remote_code=True
603
- )
604
- processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
605
-
606
- image_path = "demo/demo_image1.jpg"
607
- prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
608
-
609
- 1. Bbox format: [x1, y1, x2, y2]
610
-
611
- 2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
612
-
613
- 3. Text Extraction & Formatting Rules:
614
- - Picture: For the 'Picture' category, the text field should be omitted.
615
- - Formula: Format its text as LaTeX.
616
- - Table: Format its text as HTML.
617
- - All Others (Text, Title, etc.): Format their text as Markdown.
618
-
619
- 4. Constraints:
620
- - The output text must be the original text from the image, with no translation.
621
- - All layout elements must be sorted according to human reading order.
622
-
623
- 5. Final Output: The entire output must be a single JSON object.
624
- """
625
-
626
- messages = [
627
- {
628
- "role": "user",
629
- "content": [
630
- {
631
- "type": "image",
632
- "image": image_path
633
- },
634
- {"type": "text", "text": prompt}
635
- ]
636
- }
637
- ]
638
-
639
- # Preparation for inference
640
- text = processor.apply_chat_template(
641
- messages,
642
- tokenize=False,
643
- add_generation_prompt=True
644
- )
645
- image_inputs, video_inputs = process_vision_info(messages)
646
- inputs = processor(
647
- text=[text],
648
- images=image_inputs,
649
- videos=video_inputs,
650
- padding=True,
651
- return_tensors="pt",
652
- )
653
-
654
- inputs = inputs.to("cuda")
655
-
656
- # Inference: Generation of the output
657
- generated_ids = model.generate(**inputs, max_new_tokens=24000)
658
- generated_ids_trimmed = [
659
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
660
- ]
661
- output_text = processor.batch_decode(
662
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
663
- )
664
- print(output_text)
665
-
666
- ```
667
-
668
- </details>
669
-
670
- ## 3. Document Parse
671
- **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
672
- ```bash
673
-
674
- # Parse all layout info, both detection and recognition
675
- # Parse a single image
676
- python3 dots_ocr/parser.py demo/demo_image1.jpg
677
- # Parse a single PDF
678
- python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_thread 64 # try bigger num_threads for pdf with a large number of pages
679
-
680
- # Layout detection only
681
- python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
682
-
683
- # Parse text only, except Page-header and Page-footer
684
- python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
685
-
686
-
687
- ```
688
-
689
- <details>
690
- <summary><b>Output Results</b></summary>
691
-
692
- 1. **Structured Layout Data** (`demo_image1.json`): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
693
- 2. **Processed Markdown File** (`demo_image1.md`): A Markdown file generated from the concatenated text of all detected cells.
694
- * An additional version, `demo_image1_nohf.md`, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
695
- 3. **Layout Visualization** (`demo_image1.jpg`): The original image with the detected layout bounding boxes drawn on it.
696
-
697
- </details>
698
-
699
- ## 4. Demo
700
- Have fun with the [live demo](https://dotsocr.xiaohongshu.com/).
701
-
702
-
703
- ### Examples for document parsing
704
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula1.png" alt="formula1.png" border="0" />
705
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table3.png" alt="table3.png" border="0" />
706
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/Tibetan.png" alt="Tibetan.png" border="0" />
707
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/tradition_zh.png" alt="tradition_zh.png" border="0" />
708
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/nl.png" alt="nl.png" border="0" />
709
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/kannada.png" alt="kannada.png" border="0" />
710
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/russian.png" alt="russian.png" border="0" />
711
-
712
-
713
- ### Examples for image parsing
714
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_1.png" alt="svg_1.png" border="0" />
715
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_2.png" alt="svg_2.png" border="0" />
716
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_4.png" alt="svg_4.png" border="0" />
717
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_5.png" alt="svg_5.png" border="0" />
718
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_6.png" alt="svg_6.png" border="0" />
719
-
720
- > **Note:**
721
- > - Inferenced by dots.ocr-1.5-svg
722
-
723
- ### Example for web parsing
724
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/webpage_1.png" alt="webpage_1.png" border="0" />
725
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/webpage_2.png" alt="webpage_2.png" border="0" />
726
-
727
- ### Examples for scene spotting
728
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/scene_1.png" alt="scene_1.png" border="0" />
729
- <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/scene_2.png" alt="scene_2.png" border="0" />
730
-
731
-
732
- ## Limitation & Future Work
733
-
734
- - **Complex Document Elements:**
735
- - **Table&Formula**: The extraction of complex tables and mathematical formulas persists as a difficult task given the model's compact architecture.
736
- - **Picture**: We have adopted an SVG code representation for parsing structured graphics; however, the performance has yet to achieve the desired level of robustness.
737
-
 
 
738
  - **Parsing Failures:** While we have reduced the rate of parsing failures compared to the previous version, these issues may still occur occasionally. We remain committed to further resolving these edge cases in future updates.
 
1
+ ---
2
+ license: mit
3
+ library_name: dots_ocr_1_5
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - image-to-text
7
+ - ocr
8
+ - document-parse
9
+ - layout
10
+ - table
11
+ - formula
12
+ - transformers
13
+ - custom_code
14
+ language:
15
+ - en
16
+ - zh
17
+ - multilingual
18
+ ---
19
+
20
+ ## The model was removed from huggingface, so I re-uploaded it here from modelscope repo (the MIT license allows this).
21
+
22
+ <div align="center">
23
+
24
+ <p align="center">
25
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/logo.png" width="300"/>
26
+ <p>
27
+
28
+ <h1 align="center">
29
+ dots.ocr-1.5: Recognize Any Human Scripts and Symbols
30
+ </h1>
31
+
32
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.ocr-1.5)
33
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/rednote-hilab/dots.ocr)
34
+
35
+ <div align="center">
36
+ <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> |
37
+ <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
38
+ <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
39
+ </div>
40
+
41
+ </div>
42
+
43
+
44
+
45
+ ## Introduction
46
+
47
+ We present **dots.ocr-1.5**, a 3B-parameter multimodal model composed of a 1.2B vision encoder and a 1.7B language model. Designed for universal accessibility, it possesses the capability to recognize virtually any human script. Beyond achieving state-of-the-art (SOTA) performance in standard multilingual document parsing among models of comparable size, dots.ocr-1.5 excels at converting structured graphics (e.g., charts and diagrams) directly into SVG code, parsing web screens and spotting scene text. Furthermore, the model demonstrates competitive performance in general OCR, object grounding & counting tasks.
48
+
49
+ 1. **Stronger Document Parsing Performance:** dots.ocr-1.5 maintains SOTA performance among latest OCR models, particularly on **multilingual documents**. Addressing the significant bias inherent in the detection & matching rules of certain benchmarks —which often fail to accurately reflect a model's true capabilities—we adopted an **Elo score** evaluation system. Under this metric, the performance landscape shifts significantly, highlighting the superior robustness of our model compared to conventional rankings.
50
+ 2. **Unified Vision-Language Parsing**: Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge, akin to natural language. dots.ocr-1.5 unifies the interpretation of these elements by parsing them directly into SVG code. We have validated the effectiveness of this approach, demonstrating impressive results in structural and semantic recognition.
51
+ 3. **Broader and More General Capabilities**: Compared to dots.ocr, dots.ocr-1.5 supports a significantly wider array of tasks. It extends beyond standard OCR to handle web screen parsing, scene text spotting, object grounding & counting, and other general OCR QA tasks.
52
+
53
+
54
+ ## Evaluation
55
+
56
+ ### 1. Document Parsing
57
+
58
+ #### 1.1 Elo Score of different bench between latest models
59
+ <table>
60
+ <thead>
61
+ <tr>
62
+ <th>models</th>
63
+ <th>olmOCR-Bench</th>
64
+ <th>OmniDocBench (v1.5)</th>
65
+ <th>XDocParse</th>
66
+ </tr>
67
+ </thead>
68
+ <tbody>
69
+ <tr>
70
+ <td>GLM-OCR</td>
71
+ <td>859.9</td>
72
+ <td>937.5</td>
73
+ <td>742.1</td>
74
+ </tr>
75
+ <tr>
76
+ <td>PaddleOCR-VL-1.5</td>
77
+ <td>873.6</td>
78
+ <td>965.6</td>
79
+ <td>797.6</td>
80
+ </tr>
81
+ <tr>
82
+ <td>HuanyuanOCR</td>
83
+ <td>978.9</td>
84
+ <td>974.4</td>
85
+ <td>895.9</td>
86
+ </tr>
87
+ <tr>
88
+ <td>dots.ocr</td>
89
+ <td>1027.4</td>
90
+ <td>994.7</td>
91
+ <td>1133.4</td>
92
+ </tr>
93
+ <!-- Highlighting dots.ocr-1.5 row with bold tags -->
94
+ <tr>
95
+ <td><strong>dots.ocr-1.5</strong></td>
96
+ <td><strong>1089.0</strong></td>
97
+ <td><strong>1025.8</strong></td>
98
+ <td><strong>1157.1</strong></td>
99
+ </tr>
100
+ <tr>
101
+ <td>Gemini 3 Pro</td>
102
+ <td>1171.2</td>
103
+ <td>1102.1</td>
104
+ <td>1273.9</td>
105
+ </tr>
106
+ </tbody>
107
+ </table>
108
+
109
+
110
+ > **Notes:**
111
+ > - Results for Gemini 3 Pro, PaddleOCR-VL-1.5, and GLM-OCR were obtained via APIs, while HuanyuanOCR results were generated using local inference.
112
+ > - The Elo score evaluation was conducted using Gemini 3 Flash. The prompt can be found at: [Elo Score Prompt](https://github.com/rednote-hilab/dots.ocr/blob/master/tools/elo_score_prompt.py). These results are consistent with the findings on [ocrarena](https://www.ocrarena.ai/battle).
113
+
114
+
115
+ #### 1.2 olmOCR-bench
116
+ <table>
117
+ <thead>
118
+ <tr>
119
+ <th>Model</th>
120
+ <th>ArXiv</th>
121
+ <th>Old scans math</th>
122
+ <th>Tables</th>
123
+ <th>Old scans</th>
124
+ <th>Headers & footers</th>
125
+ <th>Multi column</th>
126
+ <th>Long tiny text</th>
127
+ <th>Base</th>
128
+ <th>Overall</th>
129
+ </tr>
130
+ </thead>
131
+ <tbody>
132
+ <tr>
133
+ <td>Mistral OCR API</td>
134
+ <td>77.2</td>
135
+ <td>67.5</td>
136
+ <td>60.6</td>
137
+ <td>29.3</td>
138
+ <td>93.6</td>
139
+ <td>71.3</td>
140
+ <td>77.1</td>
141
+ <td>99.4</td>
142
+ <td>72.0±1.1</td>
143
+ </tr>
144
+ <tr>
145
+ <td>Marker 1.10.1</td>
146
+ <td>83.8</td>
147
+ <td>66.8</td>
148
+ <td>72.9</td>
149
+ <td>33.5</td>
150
+ <td>86.6</td>
151
+ <td>80.0</td>
152
+ <td>85.7</td>
153
+ <td>99.3</td>
154
+ <td>76.1±1.1</td>
155
+ </tr>
156
+ <tr>
157
+ <td>MinerU 2.5.4*</td>
158
+ <td>76.6</td>
159
+ <td>54.6</td>
160
+ <td>84.9</td>
161
+ <td>33.7</td>
162
+ <td>96.6</td>
163
+ <td>78.2</td>
164
+ <td>83.5</td>
165
+ <td>93.7</td>
166
+ <td>75.2±1.1</td>
167
+ </tr>
168
+ <tr>
169
+ <td>DeepSeek-OCR</td>
170
+ <td>77.2</td>
171
+ <td>73.6</td>
172
+ <td>80.2</td>
173
+ <td>33.3</td>
174
+ <td>96.1</td>
175
+ <td>66.4</td>
176
+ <td>79.4</td>
177
+ <td>99.8</td>
178
+ <td>75.7±1.0</td>
179
+ </tr>
180
+ <tr>
181
+ <td>Nanonets-OCR2-3B</td>
182
+ <td>75.4</td>
183
+ <td>46.1</td>
184
+ <td>86.8</td>
185
+ <td>40.9</td>
186
+ <td>32.1</td>
187
+ <td>81.9</td>
188
+ <td>93.0</td>
189
+ <td>99.6</td>
190
+ <td>69.5±1.1</td>
191
+ </tr>
192
+ <tr>
193
+ <td>PaddleOCR-VL*</td>
194
+ <td>85.7</td>
195
+ <td>71.0</td>
196
+ <td>84.1</td>
197
+ <td>37.8</td>
198
+ <td>97.0</td>
199
+ <td>79.9</td>
200
+ <td>85.7</td>
201
+ <td>98.5</td>
202
+ <td>80.0±1.0</td>
203
+ </tr>
204
+ <tr>
205
+ <td>Infinity-Parser 7B*</td>
206
+ <td>84.4</td>
207
+ <td>83.8</td>
208
+ <td>85.0</td>
209
+ <td>47.9</td>
210
+ <td>88.7</td>
211
+ <td>84.2</td>
212
+ <td>86.4</td>
213
+ <td>99.8</td>
214
+ <td>82.5±?</td>
215
+ </tr>
216
+ <tr>
217
+ <td>olmOCR v0.4.0</td>
218
+ <td>83.0</td>
219
+ <td>82.3</td>
220
+ <td>84.9</td>
221
+ <td>47.7</td>
222
+ <td>96.1</td>
223
+ <td>83.7</td>
224
+ <td>81.9</td>
225
+ <td>99.7</td>
226
+ <td>82.4±1.1</td>
227
+ </tr>
228
+ <tr>
229
+ <td>Chandra OCR 0.1.0*</td>
230
+ <td>82.2</td>
231
+ <td>80.3</td>
232
+ <td>88.0</td>
233
+ <td>50.4</td>
234
+ <td>90.8</td>
235
+ <td>81.2</td>
236
+ <td>92.3</td>
237
+ <td>99.9</td>
238
+ <td>83.1±0.9</td>
239
+ </tr>
240
+ <tr>
241
+ <td>dots.ocr</td>
242
+ <td>82.1</td>
243
+ <td>64.2</td>
244
+ <td>88.3</td>
245
+ <td>40.9</td>
246
+ <td>94.1</td>
247
+ <td>82.4</td>
248
+ <td>81.2</td>
249
+ <td>99.5</td>
250
+ <td>79.1±1.0</td>
251
+ </tr>
252
+ <tr>
253
+ <td><strong>dots.ocr-1.5</strong></td>
254
+ <td><strong>85.9</strong></td>
255
+ <td><strong>85.5</strong></td>
256
+ <td><strong>90.7</strong></td>
257
+ <td>48.2</td>
258
+ <td>94.0</td>
259
+ <td><strong>85.3</strong></td>
260
+ <td>81.6</td>
261
+ <td>99.7</td>
262
+ <td><strong>83.9±0.9</strong></td>
263
+ </tr>
264
+ </tbody>
265
+ </table>
266
+
267
+
268
+ > **Note:**
269
+ > - The metrics are from [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
270
+ > - We delete the Page-header and Page-footer cells in the result markdown.
271
+
272
+
273
+ #### 1.3 Other Benchmarks
274
+
275
+ <table>
276
+ <thead>
277
+ <tr>
278
+ <th>Model Type</th>
279
+ <th>Methods</th>
280
+ <th>Size</th>
281
+ <th>OmniDocBench(v1.5)<br>TextEdit↓</th>
282
+ <th>OmniDocBench(v1.5)<br>Read OrderEdit↓</th>
283
+ <th>pdf-parse-bench</th>
284
+ </tr>
285
+ </thead>
286
+ <tbody>
287
+ <!-- GeneralVLMs Group (Reversed Order, 3 rows) -->
288
+ <tr>
289
+ <td rowspan="3"><strong>GeneralVLMs</strong></td>
290
+ <td>Gemini-2.5 Pro</td>
291
+ <td>-</td>
292
+ <td>0.075</td>
293
+ <td>0.097</td>
294
+ <td>9.06</td>
295
+ </tr>
296
+ <tr>
297
+ <td>Qwen3-VL-235B-A22B-Instruct</td>
298
+ <td>235B</td>
299
+ <td>0.069</td>
300
+ <td>0.068</td>
301
+ <td><strong>9.71</strong></td>
302
+ </tr>
303
+ <tr>
304
+ <td>gemini3pro</td>
305
+ <td>-</td>
306
+ <td>0.066</td>
307
+ <td>0.079</td>
308
+ <td>9.68</td>
309
+ </tr>
310
+ <!-- SpecializedVLMs Group (Reversed Order, 12 rows) -->
311
+ <tr>
312
+ <td rowspan="12"><strong>SpecializedVLMs</strong></td>
313
+ <td>Mistral OCR</td>
314
+ <td>-</td>
315
+ <td>0.164</td>
316
+ <td>0.144</td>
317
+ <td>8.84</td>
318
+ </tr>
319
+ <tr>
320
+ <td>Deepseek-OCR</td>
321
+ <td>3B</td>
322
+ <td>0.073</td>
323
+ <td>0.086</td>
324
+ <td>8.26</td>
325
+ </tr>
326
+ <tr>
327
+ <td>MonkeyOCR-3B</td>
328
+ <td>3B</td>
329
+ <td>0.075</td>
330
+ <td>0.129</td>
331
+ <td>9.27</td>
332
+ </tr>
333
+ <tr>
334
+ <td>OCRVerse</td>
335
+ <td>4B</td>
336
+ <td>0.058</td>
337
+ <td>0.071</td>
338
+ <td>--</td>
339
+ </tr>
340
+ <tr>
341
+ <td>MonkeyOCR-pro-3B</td>
342
+ <td>3B</td>
343
+ <td>0.075</td>
344
+ <td>0.128</td>
345
+ <td>-</td>
346
+ </tr>
347
+ <tr>
348
+ <td>MinerU2.5</td>
349
+ <td>1.2B</td>
350
+ <td>0.047</td>
351
+ <td>0.044</td>
352
+ <td>-</td>
353
+ </tr>
354
+ <tr>
355
+ <td>PaddleOCR-VL</td>
356
+ <td>0.9B</td>
357
+ <td>0.035</td>
358
+ <td>0.043</td>
359
+ <td>9.51</td>
360
+ </tr>
361
+ <tr>
362
+ <td>HunyuanOCR</td>
363
+ <td>0.9B</td>
364
+ <td>0.042</td>
365
+ <td>-</td>
366
+ <td>-</td>
367
+ </tr>
368
+ <tr>
369
+ <td>PaddleOCR-VL1.5</td>
370
+ <td>0.9B</td>
371
+ <td>0.035</td>
372
+ <td>0.042</td>
373
+ <td>-</td>
374
+ </tr>
375
+ <tr>
376
+ <td>GLMOCR</td>
377
+ <td>0.9B</td>
378
+ <td>0.04</td>
379
+ <td>0.043</td>
380
+ <td>-</td>
381
+ </tr>
382
+ <tr>
383
+ <td>dots.ocr</td>
384
+ <td>3B</td>
385
+ <td>0.048</td>
386
+ <td>0.053</td>
387
+ <td>9.29</td>
388
+ </tr>
389
+ <tr>
390
+ <td><u><strong>dots.ocr-1.5</strong></u></td>
391
+ <td>3B</td>
392
+ <td><strong>0.031</strong></td>
393
+ <td><strong>0.029</strong></td>
394
+ <td>9.54</td>
395
+ </tr>
396
+ </tbody>
397
+ </table>
398
+
399
+ > **Note:**
400
+ > - Metrics are sourced from [OmniDocBench](https://github.com/opendatalab/OmniDocBench) and other model publications. [pdf-parse-bench](https://github.com/phorn1/pdf-parse-bench) results are reproduced by Qwen3-VL-235B-A22B-Instruct.
401
+ > - Formula and Table metrics for OmniDocBench1.5 are omitted due to their high sensitivity to detection and matching protocols.
402
+
403
+
404
+ ### 2. Vision-Language Parsing
405
+ Visual languages (e.g., charts, graphics, chemical formulas, logos) encapsulate dense human knowledge. **dots.ocr-1.5** unifies the interpretation of these elements by parsing them directly into **SVG code**.
406
+
407
+ <table>
408
+ <thead>
409
+ <tr>
410
+ <th rowspan="2" style="text-align: left;">Methods</th>
411
+ <th colspan="3">Unisvg</th>
412
+ <th rowspan="2">Chartmimic</th>
413
+ <th rowspan="2">Design2Code</th>
414
+ <th rowspan="2">Genexam</th>
415
+ <th rowspan="2">SciGen</th>
416
+ <th rowspan="2">ChemDraw</th>
417
+ </tr>
418
+ <tr>
419
+ <th>Low-Level</th>
420
+ <th>High-Level</th>
421
+ <th>Score</th>
422
+ </tr>
423
+ </thead>
424
+ <tbody>
425
+ <tr>
426
+ <td style="text-align: left;">OCRVerse</td>
427
+ <td>0.632</td>
428
+ <td>0.852</td>
429
+ <td>0.763</td>
430
+ <td>0.799</td>
431
+ <td>-</td>
432
+ <td>-</td>
433
+ <td>-</td>
434
+ <td>0.881</td>
435
+ </tr>
436
+ <tr>
437
+ <td style="text-align: left;">Gemini 3 Pro</td>
438
+ <td>0.563</td>
439
+ <td>0.850</td>
440
+ <td>0.735</td>
441
+ <td>0.788</td>
442
+ <td>0.760</td>
443
+ <td>0.756</td>
444
+ <td>0.783</td>
445
+ <td>0.839</td>
446
+ </tr>
447
+ <tr>
448
+ <td style="text-align: left;">dots.ocr-1.5</td>
449
+ <td>0.850</td>
450
+ <td>0.923</td>
451
+ <td>0.894</td>
452
+ <td>0.772</td>
453
+ <td>0.801</td>
454
+ <td>0.664</td>
455
+ <td>0.660</td>
456
+ <td>0.790</td>
457
+ </tr>
458
+ <tr>
459
+ <td style="text-align: left;"><strong>dots.ocr-1.5-svg</strong></td>
460
+ <td><strong>0.860</strong></td>
461
+ <td><strong>0.931</strong></td>
462
+ <td><strong>0.902</strong></td>
463
+ <td><strong>0.905</strong></td>
464
+ <td><strong>0.834</strong></td>
465
+ <td><strong>0.8</strong></td>
466
+ <td><strong>0.797</strong></td>
467
+ <td><strong>0.901</strong></td>
468
+ </tr>
469
+ </tbody>
470
+ </table>
471
+
472
+
473
+ > **Note:**
474
+ > - We use the ISVGEN metric from [UniSVG](https://ryanlijinke.github.io/) to evaluate the parsing result. For benchmarks that do not natively support image parsing, we use the original images as input, and calculate the ISVGEN score between the rendered output and the original image.
475
+ > - [OCRVerse](https://github.com/DocTron-hub/OCRVerse) results are derived from various code formats (e.g., SVG, Python), whereas results for Gemini 3 Pro and dots.ocr-1.5 are based specifically on SVG code.
476
+ > - Due to the capacity constraints of a 3B-parameter VLM, dots.ocr-1.5 may not excel in all tasks yet like svg. To complement this, we are simultaneously releasing dots.ocr-1.5-svg. We plan to further address these limitations in future updates.
477
+
478
+
479
+ ### 3. General Vision Tasks
480
+
481
+ <table>
482
+ <thead>
483
+ <tr>
484
+ <th>Model</th>
485
+ <th>CharXiv_descriptive</th>
486
+ <th>CharXiv_reasoning</th>
487
+ <th>OCR_Reasoning</th>
488
+ <th>infovqa</th>
489
+ <th>docvqa</th>
490
+ <th>ChartQA</th>
491
+ <th>OCRBench</th>
492
+ <th>AI2D</th>
493
+ <th>CountBenchQA</th>
494
+ <th>refcoco</th>
495
+ </tr>
496
+ </thead>
497
+ <tbody>
498
+ <tr>
499
+ <td>Qwen3vl-2b-instruct</td>
500
+ <td>62.3</td>
501
+ <td>26.8</td>
502
+ <td>-</td>
503
+ <td>72.4</td>
504
+ <td>93.3</td>
505
+ <td>-</td>
506
+ <td>85.8</td>
507
+ <td>76.9</td>
508
+ <td>88.4</td>
509
+ <td>-</td>
510
+ </tr>
511
+ <tr>
512
+ <td><strong>dots.ocr-1.5</strong></td>
513
+ <td>77.4</td>
514
+ <td>55.3</td>
515
+ <td>22.85</td>
516
+ <td>73.76</td>
517
+ <td>91.85</td>
518
+ <td>83.2</td>
519
+ <td>86.0</td>
520
+ <td>82.16</td>
521
+ <td>94.46</td>
522
+ <td>80.03</td>
523
+ </tr>
524
+ </tbody>
525
+ </table>
526
+
527
+
528
+
529
+ # Quick Start
530
+ ## 1. Installation
531
+ ### Install dots.ocr-1.5
532
+ ```shell
533
+ conda create -n dots_ocr python=3.12
534
+ conda activate dots_ocr
535
+
536
+ git clone https://github.com/rednote-hilab/dots.ocr.git
537
+ cd dots.ocr
538
+
539
+ # Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
540
+ pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
541
+ pip install -e .
542
+ ```
543
+
544
+ If you have trouble with the installation, try our [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) for an easier setup, and follow these steps:
545
+ ```shell
546
+ git clone https://github.com/rednote-hilab/dots.ocr.git
547
+ cd dots.ocr
548
+ pip install -e .
549
+ ```
550
+
551
+
552
+ ### Download Model Weights
553
+ > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR_1_5` instead of `dots.ocr-1.5`) for the model save path. This is a temporary workaround pending our integration with Transformers.
554
+ ```shell
555
+ python3 tools/download_model.py
556
+ ```
557
+
558
+
559
+ ## 2. Deployment
560
+ ### vLLM inference
561
+ We highly recommend using vllm for deployment and inference.
562
+
563
+ ```shell
564
+ # launch vllm server
565
+ ## dots.ocr-1.5
566
+ CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code
567
+
568
+ ## dots.ocr-1.5-svg
569
+ CUDA_VISIBLE_DEVICES=0 vllm serve rednote-hilab/dots.ocr-1.5-svg --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --chat-template-content-format string --served-model-name model --trust-remote-code
570
+
571
+ # vllm api demo
572
+ ## document parsing
573
+ python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
574
+ ## web parsing
575
+ python3 ./demo/demo_vllm.py --prompt_mode prompt_web_parsing --image_path ./assets/showcase_dots_ocr_1_5/origin/webpage_1.png
576
+ ## scene spoting
577
+ python3 ./demo/demo_vllm.py --prompt_mode prompt_scene_spotting --image_path ./assets/showcase_dots_ocr_1_5/origin/scene_1.jpg
578
+ ## image parsing with svg code
579
+ python3 ./demo/demo_vllm_svg.py --prompt_mode prompt_image_to_svg
580
+ ## general qa
581
+ python3 ./demo/demo_vllm_general.py
582
+ ```
583
+
584
+ ### Hugginface inference
585
+ ```shell
586
+ python3 demo/demo_hf.py
587
+ ```
588
+
589
+ <details>
590
+ <summary><b>Hugginface inference details</b></summary>
591
+
592
+ ```python
593
+ import torch
594
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
595
+ from qwen_vl_utils import process_vision_info
596
+ from dots_ocr.utils import dict_promptmode_to_prompt
597
+
598
+ model_path = "./weights/DotsOCR_1_5"
599
+ model = AutoModelForCausalLM.from_pretrained(
600
+ model_path,
601
+ attn_implementation="flash_attention_2",
602
+ torch_dtype=torch.bfloat16,
603
+ device_map="auto",
604
+ trust_remote_code=True
605
+ )
606
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
607
+
608
+ image_path = "demo/demo_image1.jpg"
609
+ prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
610
+
611
+ 1. Bbox format: [x1, y1, x2, y2]
612
+
613
+ 2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
614
+
615
+ 3. Text Extraction & Formatting Rules:
616
+ - Picture: For the 'Picture' category, the text field should be omitted.
617
+ - Formula: Format its text as LaTeX.
618
+ - Table: Format its text as HTML.
619
+ - All Others (Text, Title, etc.): Format their text as Markdown.
620
+
621
+ 4. Constraints:
622
+ - The output text must be the original text from the image, with no translation.
623
+ - All layout elements must be sorted according to human reading order.
624
+
625
+ 5. Final Output: The entire output must be a single JSON object.
626
+ """
627
+
628
+ messages = [
629
+ {
630
+ "role": "user",
631
+ "content": [
632
+ {
633
+ "type": "image",
634
+ "image": image_path
635
+ },
636
+ {"type": "text", "text": prompt}
637
+ ]
638
+ }
639
+ ]
640
+
641
+ # Preparation for inference
642
+ text = processor.apply_chat_template(
643
+ messages,
644
+ tokenize=False,
645
+ add_generation_prompt=True
646
+ )
647
+ image_inputs, video_inputs = process_vision_info(messages)
648
+ inputs = processor(
649
+ text=[text],
650
+ images=image_inputs,
651
+ videos=video_inputs,
652
+ padding=True,
653
+ return_tensors="pt",
654
+ )
655
+
656
+ inputs = inputs.to("cuda")
657
+
658
+ # Inference: Generation of the output
659
+ generated_ids = model.generate(**inputs, max_new_tokens=24000)
660
+ generated_ids_trimmed = [
661
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
662
+ ]
663
+ output_text = processor.batch_decode(
664
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
665
+ )
666
+ print(output_text)
667
+
668
+ ```
669
+
670
+ </details>
671
+
672
+ ## 3. Document Parse
673
+ **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
674
+ ```bash
675
+
676
+ # Parse all layout info, both detection and recognition
677
+ # Parse a single image
678
+ python3 dots_ocr/parser.py demo/demo_image1.jpg
679
+ # Parse a single PDF
680
+ python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_thread 64 # try bigger num_threads for pdf with a large number of pages
681
+
682
+ # Layout detection only
683
+ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
684
+
685
+ # Parse text only, except Page-header and Page-footer
686
+ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
687
+
688
+
689
+ ```
690
+
691
+ <details>
692
+ <summary><b>Output Results</b></summary>
693
+
694
+ 1. **Structured Layout Data** (`demo_image1.json`): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
695
+ 2. **Processed Markdown File** (`demo_image1.md`): A Markdown file generated from the concatenated text of all detected cells.
696
+ * An additional version, `demo_image1_nohf.md`, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
697
+ 3. **Layout Visualization** (`demo_image1.jpg`): The original image with the detected layout bounding boxes drawn on it.
698
+
699
+ </details>
700
+
701
+ ## 4. Demo
702
+ Have fun with the [live demo](https://dotsocr.xiaohongshu.com/).
703
+
704
+
705
+ ### Examples for document parsing
706
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula1.png" alt="formula1.png" border="0" />
707
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table3.png" alt="table3.png" border="0" />
708
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/Tibetan.png" alt="Tibetan.png" border="0" />
709
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/tradition_zh.png" alt="tradition_zh.png" border="0" />
710
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/nl.png" alt="nl.png" border="0" />
711
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/kannada.png" alt="kannada.png" border="0" />
712
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/russian.png" alt="russian.png" border="0" />
713
+
714
+
715
+ ### Examples for image parsing
716
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_1.png" alt="svg_1.png" border="0" />
717
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_2.png" alt="svg_2.png" border="0" />
718
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_4.png" alt="svg_4.png" border="0" />
719
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_5.png" alt="svg_5.png" border="0" />
720
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/svg_6.png" alt="svg_6.png" border="0" />
721
+
722
+ > **Note:**
723
+ > - Inferenced by dots.ocr-1.5-svg
724
+
725
+ ### Example for web parsing
726
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/webpage_1.png" alt="webpage_1.png" border="0" />
727
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/webpage_2.png" alt="webpage_2.png" border="0" />
728
+
729
+ ### Examples for scene spotting
730
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/scene_1.png" alt="scene_1.png" border="0" />
731
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase_dots_ocr_1_5/result/scene_2.png" alt="scene_2.png" border="0" />
732
+
733
+
734
+ ## Limitation & Future Work
735
+
736
+ - **Complex Document Elements:**
737
+ - **Table&Formula**: The extraction of complex tables and mathematical formulas persists as a difficult task given the model's compact architecture.
738
+ - **Picture**: We have adopted an SVG code representation for parsing structured graphics; however, the performance has yet to achieve the desired level of robustness.
739
+
740
  - **Parsing Failures:** While we have reduced the rate of parsing failures compared to the previous version, these issues may still occur occasionally. We remain committed to further resolving these edge cases in future updates.