davanstrien HF Staff commited on
Commit
282151e
·
verified ·
1 Parent(s): 826373f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +232 -0
README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: dots-ocr-license
4
+ license_link: https://huggingface.co/davanstrien/dots.ocr-1.5/blob/main/dots.ocr-1.5%20LICENSE%20AGREEMENT
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
+ tags:
8
+ - image-to-text
9
+ - ocr
10
+ - document-parse
11
+ - layout
12
+ - table
13
+ - formula
14
+ - custom_code
15
+ language:
16
+ - en
17
+ - zh
18
+ - multilingual
19
+ ---
20
+
21
+ > **Unofficial mirror.** This is a copy of [dots.ocr-1.5 from ModelScope](https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5), uploaded to Hugging Face for easier access. All credit goes to the original authors at **rednote-hilab (Xiaohongshu)**. The original v1 model is at [rednote-hilab/dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr) on HF. If the authors publish an official HF release of v1.5, please use that instead.
22
+ >
23
+ > Source: [ModelScope](https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5) | [GitHub](https://github.com/rednote-hilab/dots.ocr)
24
+
25
+ # dots.ocr-1.5: Recognize Any Human Scripts and Symbols
26
+
27
+ A **3B-parameter** multimodal OCR model (1.2B vision encoder + 1.7B language model) from rednote-hilab. Designed for universal accessibility, it can recognize virtually any human script and achieves SOTA performance in multilingual document parsing among models of comparable size.
28
+
29
+ ## Key Capabilities
30
+
31
+ 1. **Multilingual Document Parsing** — SOTA on standard benchmarks among specialized OCR models, particularly strong on multilingual documents
32
+ 2. **Structured Graphics to SVG** — Converts charts, diagrams, chemical formulas, and logos directly into SVG code
33
+ 3. **Web Screen Parsing & Scene Text Spotting** — Handles web screenshots and scene text
34
+ 4. **Object Grounding & Counting** — General vision tasks beyond pure OCR
35
+ 5. **General OCR & Visual QA** — DocVQA 91.85, ChartQA 83.2, OCRBench 86.0
36
+
37
+ ## Quick Start with UV Scripts
38
+
39
+ Process any HF dataset with a single command using [uv-scripts/ocr](https://huggingface.co/datasets/uv-scripts/ocr):
40
+
41
+ ```bash
42
+ # Basic OCR
43
+ hf jobs uv run --flavor l4x1 -s HF_TOKEN \
44
+ https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr-1.5.py \
45
+ your-input-dataset your-output-dataset \
46
+ --model davanstrien/dots.ocr-1.5
47
+
48
+ # Layout analysis with bounding boxes
49
+ hf jobs uv run --flavor l4x1 -s HF_TOKEN \
50
+ https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr-1.5.py \
51
+ your-input-dataset your-output-dataset \
52
+ --model davanstrien/dots.ocr-1.5 \
53
+ --prompt-mode layout-all
54
+ ```
55
+
56
+ ## Benchmarks
57
+
58
+ ### Document Parsing (Elo Score)
59
+
60
+ | Model | olmOCR-Bench | OmniDocBench v1.5 | XDocParse |
61
+ |-------|:---:|:---:|:---:|
62
+ | GLM-OCR | 859.9 | 937.5 | 742.1 |
63
+ | PaddleOCR-VL-1.5 | 873.6 | 965.6 | 797.6 |
64
+ | HuanyuanOCR | 978.9 | 974.4 | 895.9 |
65
+ | dots.ocr | 1027.4 | 994.7 | 1133.4 |
66
+ | **dots.ocr-1.5** | **1089.0** | **1025.8** | **1157.1** |
67
+ | Gemini 3 Pro | 1171.2 | 1102.1 | 1273.9 |
68
+
69
+ ### olmOCR-bench (detailed)
70
+
71
+ | Model | ArXiv | Old scans math | Tables | Overall |
72
+ |-------|:---:|:---:|:---:|:---:|
73
+ | olmOCR v0.4.0 | 83.0 | 82.3 | 84.9 | 82.4±1.1 |
74
+ | Chandra OCR 0.1.0 | 82.2 | 80.3 | 88.0 | 83.1±0.9 |
75
+ | **dots.ocr-1.5** | **85.9** | **85.5** | **90.7** | **83.9±0.9** |
76
+
77
+ ### General Vision Tasks
78
+
79
+ | DocVQA | ChartQA | OCRBench | AI2D | CharXiv Descriptive | RefCOCO |
80
+ |:---:|:---:|:---:|:---:|:---:|:---:|
81
+ | 91.85 | 83.2 | 86.0 | 82.16 | 77.4 | 80.03 |
82
+
83
+ ## Usage
84
+
85
+ ### vLLM (recommended)
86
+
87
+ **Important:** When using `llm.chat()`, you must pass `chat_template_content_format="string"`. The model's tokenizer chat template expects string content, not OpenAI-format lists. Without this, the model produces empty output.
88
+
89
+ ```python
90
+ from vllm import LLM, SamplingParams
91
+
92
+ llm = LLM(
93
+ model="davanstrien/dots.ocr-1.5",
94
+ trust_remote_code=True,
95
+ max_model_len=24000,
96
+ gpu_memory_utilization=0.9,
97
+ )
98
+
99
+ sampling_params = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=24000)
100
+
101
+ messages = [{
102
+ "role": "user",
103
+ "content": [
104
+ {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
105
+ {"type": "text", "text": "Extract the text content from this image."},
106
+ ],
107
+ }]
108
+
109
+ outputs = llm.chat(
110
+ [messages],
111
+ sampling_params,
112
+ chat_template_content_format="string", # Required!
113
+ )
114
+ print(outputs[0].outputs[0].text)
115
+ ```
116
+
117
+ ### vLLM Server
118
+
119
+ ```bash
120
+ vllm serve davanstrien/dots.ocr-1.5 \
121
+ --tensor-parallel-size 1 \
122
+ --gpu-memory-utilization 0.9 \
123
+ --chat-template-content-format string \
124
+ --trust-remote-code
125
+ ```
126
+
127
+ ### Transformers
128
+
129
+ ```python
130
+ import torch
131
+ from transformers import AutoModelForCausalLM, AutoProcessor
132
+ from qwen_vl_utils import process_vision_info
133
+
134
+ model = AutoModelForCausalLM.from_pretrained(
135
+ "davanstrien/dots.ocr-1.5",
136
+ attn_implementation="flash_attention_2",
137
+ torch_dtype=torch.bfloat16,
138
+ device_map="auto",
139
+ trust_remote_code=True,
140
+ )
141
+ processor = AutoProcessor.from_pretrained("davanstrien/dots.ocr-1.5", trust_remote_code=True)
142
+
143
+ messages = [{
144
+ "role": "user",
145
+ "content": [
146
+ {"type": "image", "image": "document.jpg"},
147
+ {"type": "text", "text": "Extract the text content from this image."},
148
+ ],
149
+ }]
150
+
151
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
152
+ image_inputs, video_inputs = process_vision_info(messages)
153
+ inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
154
+
155
+ generated_ids = model.generate(**inputs, max_new_tokens=24000)
156
+ output = processor.batch_decode(
157
+ [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
158
+ skip_special_tokens=True,
159
+ )[0]
160
+ print(output)
161
+ ```
162
+
163
+ ## Prompt Modes
164
+
165
+ | Mode | Description | Output |
166
+ |------|-------------|--------|
167
+ | `ocr` | Text extraction (default) | Markdown |
168
+ | `layout-all` | Layout + bboxes + categories + text | JSON |
169
+ | `layout-only` | Layout + bboxes + categories (no text) | JSON |
170
+ | `web-parsing` | Webpage layout analysis | JSON |
171
+ | `scene-spotting` | Scene text detection | Text |
172
+ | `grounding-ocr` | Text from bounding box region | Text |
173
+ | `general` | Free-form (custom prompt) | Varies |
174
+
175
+ ### Bbox Coordinate System (layout modes)
176
+
177
+ Bounding boxes are in the **resized image coordinate space**, not original image coordinates. The model uses `Qwen2VLImageProcessor` which resizes images so that `width × height ≤ 11,289,600` pixels, with dimensions rounded to multiples of 28.
178
+
179
+ To map bboxes back to original coordinates:
180
+
181
+ ```python
182
+ import math
183
+
184
+ def smart_resize(height, width, factor=28, min_pixels=3136, max_pixels=11289600):
185
+ h_bar = max(factor, round(height / factor) * factor)
186
+ w_bar = max(factor, round(width / factor) * factor)
187
+ if h_bar * w_bar > max_pixels:
188
+ beta = math.sqrt((height * width) / max_pixels)
189
+ h_bar = math.floor(height / beta / factor) * factor
190
+ w_bar = math.floor(width / beta / factor) * factor
191
+ elif h_bar * w_bar < min_pixels:
192
+ beta = math.sqrt(min_pixels / (height * width))
193
+ h_bar = math.ceil(height * beta / factor) * factor
194
+ w_bar = math.ceil(width * beta / factor) * factor
195
+ return h_bar, w_bar
196
+
197
+ resized_h, resized_w = smart_resize(orig_h, orig_w)
198
+ scale_x, scale_y = orig_w / resized_w, orig_h / resized_h
199
+ # orig_x = bbox_x * scale_x, orig_y = bbox_y * scale_y
200
+ ```
201
+
202
+ ## Model Details
203
+
204
+ - **Architecture:** DotsOCRForCausalLM (custom code, `trust_remote_code=True` required)
205
+ - **Parameters:** 3B total (1.2B vision encoder, 1.7B language model)
206
+ - **Precision:** BF16
207
+ - **Max context:** 131,072 tokens
208
+ - **Vision:** Patch size 14, spatial merge size 2, flash_attention_2
209
+ - **Languages:** English, Chinese (simplified + traditional), multilingual (Tibetan, Kannada, Russian, Dutch, and more)
210
+
211
+ ## Limitations
212
+
213
+ - Complex table and formula extraction remains challenging for the compact 3B architecture
214
+ - SVG parsing for pictures needs further robustness improvements
215
+ - Occasional parsing failures on edge cases
216
+
217
+ ## License
218
+
219
+ This model is released under the [dots.ocr License Agreement](dots.ocr-1.5%20LICENSE%20AGREEMENT), which is based on the MIT License with supplementary terms covering responsible use, attribution, and data governance. Per the license: *"If Licensee distributes modified weights or fine-tuned models based on the Model Materials, Licensee must prominently display the following statement: 'Built with dots.ocr.'"*
220
+
221
+ ## Citation
222
+
223
+ ```bibtex
224
+ @misc{dots_ocr_1_5,
225
+ title={dots.ocr-1.5: Recognize Any Human Scripts and Symbols},
226
+ author={rednote-hilab},
227
+ year={2025},
228
+ url={https://github.com/rednote-hilab/dots.ocr}
229
+ }
230
+ ```
231
+
232
+ Built with dots.ocr.