File size: 8,493 Bytes
282151e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
license: other
license_name: dots-ocr-license
license_link: https://huggingface.co/davanstrien/dots.ocr-1.5/blob/main/dots.ocr-1.5%20LICENSE%20AGREEMENT
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- image-to-text
- ocr
- document-parse
- layout
- table
- formula
- custom_code
language:
- en
- zh
- multilingual
---

> **Unofficial mirror.** This is a copy of [dots.ocr-1.5 from ModelScope](https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5), uploaded to Hugging Face for easier access. All credit goes to the original authors at **rednote-hilab (Xiaohongshu)**. The original v1 model is at [rednote-hilab/dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr) on HF. If the authors publish an official HF release of v1.5, please use that instead.
>
> Source: [ModelScope](https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5) | [GitHub](https://github.com/rednote-hilab/dots.ocr)

# dots.ocr-1.5: Recognize Any Human Scripts and Symbols

A **3B-parameter** multimodal OCR model (1.2B vision encoder + 1.7B language model) from rednote-hilab. Designed for universal accessibility, it can recognize virtually any human script and achieves SOTA performance in multilingual document parsing among models of comparable size.

## Key Capabilities

1. **Multilingual Document Parsing** — SOTA on standard benchmarks among specialized OCR models, particularly strong on multilingual documents
2. **Structured Graphics to SVG** — Converts charts, diagrams, chemical formulas, and logos directly into SVG code
3. **Web Screen Parsing & Scene Text Spotting** — Handles web screenshots and scene text
4. **Object Grounding & Counting** — General vision tasks beyond pure OCR
5. **General OCR & Visual QA** — DocVQA 91.85, ChartQA 83.2, OCRBench 86.0

## Quick Start with UV Scripts

Process any HF dataset with a single command using [uv-scripts/ocr](https://huggingface.co/datasets/uv-scripts/ocr):

```bash
# Basic OCR
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
    https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr-1.5.py \
    your-input-dataset your-output-dataset \
    --model davanstrien/dots.ocr-1.5

# Layout analysis with bounding boxes
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
    https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr-1.5.py \
    your-input-dataset your-output-dataset \
    --model davanstrien/dots.ocr-1.5 \
    --prompt-mode layout-all
```

## Benchmarks

### Document Parsing (Elo Score)

| Model | olmOCR-Bench | OmniDocBench v1.5 | XDocParse |
|-------|:---:|:---:|:---:|
| GLM-OCR | 859.9 | 937.5 | 742.1 |
| PaddleOCR-VL-1.5 | 873.6 | 965.6 | 797.6 |
| HuanyuanOCR | 978.9 | 974.4 | 895.9 |
| dots.ocr | 1027.4 | 994.7 | 1133.4 |
| **dots.ocr-1.5** | **1089.0** | **1025.8** | **1157.1** |
| Gemini 3 Pro | 1171.2 | 1102.1 | 1273.9 |

### olmOCR-bench (detailed)

| Model | ArXiv | Old scans math | Tables | Overall |
|-------|:---:|:---:|:---:|:---:|
| olmOCR v0.4.0 | 83.0 | 82.3 | 84.9 | 82.4±1.1 |
| Chandra OCR 0.1.0 | 82.2 | 80.3 | 88.0 | 83.1±0.9 |
| **dots.ocr-1.5** | **85.9** | **85.5** | **90.7** | **83.9±0.9** |

### General Vision Tasks

| DocVQA | ChartQA | OCRBench | AI2D | CharXiv Descriptive | RefCOCO |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 91.85 | 83.2 | 86.0 | 82.16 | 77.4 | 80.03 |

## Usage

### vLLM (recommended)

**Important:** When using `llm.chat()`, you must pass `chat_template_content_format="string"`. The model's tokenizer chat template expects string content, not OpenAI-format lists. Without this, the model produces empty output.

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="davanstrien/dots.ocr-1.5",
    trust_remote_code=True,
    max_model_len=24000,
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=24000)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Extract the text content from this image."},
    ],
}]

outputs = llm.chat(
    [messages],
    sampling_params,
    chat_template_content_format="string",  # Required!
)
print(outputs[0].outputs[0].text)
```

### vLLM Server

```bash
vllm serve davanstrien/dots.ocr-1.5 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --chat-template-content-format string \
    --trust-remote-code
```

### Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info

model = AutoModelForCausalLM.from_pretrained(
    "davanstrien/dots.ocr-1.5",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("davanstrien/dots.ocr-1.5", trust_remote_code=True)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "document.jpg"},
        {"type": "text", "text": "Extract the text content from this image."},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=24000)
output = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True,
)[0]
print(output)
```

## Prompt Modes

| Mode | Description | Output |
|------|-------------|--------|
| `ocr` | Text extraction (default) | Markdown |
| `layout-all` | Layout + bboxes + categories + text | JSON |
| `layout-only` | Layout + bboxes + categories (no text) | JSON |
| `web-parsing` | Webpage layout analysis | JSON |
| `scene-spotting` | Scene text detection | Text |
| `grounding-ocr` | Text from bounding box region | Text |
| `general` | Free-form (custom prompt) | Varies |

### Bbox Coordinate System (layout modes)

Bounding boxes are in the **resized image coordinate space**, not original image coordinates. The model uses `Qwen2VLImageProcessor` which resizes images so that `width × height ≤ 11,289,600` pixels, with dimensions rounded to multiples of 28.

To map bboxes back to original coordinates:

```python
import math

def smart_resize(height, width, factor=28, min_pixels=3136, max_pixels=11289600):
    h_bar = max(factor, round(height / factor) * factor)
    w_bar = max(factor, round(width / factor) * factor)
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor
    return h_bar, w_bar

resized_h, resized_w = smart_resize(orig_h, orig_w)
scale_x, scale_y = orig_w / resized_w, orig_h / resized_h
# orig_x = bbox_x * scale_x, orig_y = bbox_y * scale_y
```

## Model Details

- **Architecture:** DotsOCRForCausalLM (custom code, `trust_remote_code=True` required)
- **Parameters:** 3B total (1.2B vision encoder, 1.7B language model)
- **Precision:** BF16
- **Max context:** 131,072 tokens
- **Vision:** Patch size 14, spatial merge size 2, flash_attention_2
- **Languages:** English, Chinese (simplified + traditional), multilingual (Tibetan, Kannada, Russian, Dutch, and more)

## Limitations

- Complex table and formula extraction remains challenging for the compact 3B architecture
- SVG parsing for pictures needs further robustness improvements
- Occasional parsing failures on edge cases

## License

This model is released under the [dots.ocr License Agreement](dots.ocr-1.5%20LICENSE%20AGREEMENT), which is based on the MIT License with supplementary terms covering responsible use, attribution, and data governance. Per the license: *"If Licensee distributes modified weights or fine-tuned models based on the Model Materials, Licensee must prominently display the following statement: 'Built with dots.ocr.'"*

## Citation

```bibtex
@misc{dots_ocr_1_5,
  title={dots.ocr-1.5: Recognize Any Human Scripts and Symbols},
  author={rednote-hilab},
  year={2025},
  url={https://github.com/rednote-hilab/dots.ocr}
}
```

Built with dots.ocr.