dots.ocr-1.5 / README.md

davanstrien HF Staff

Upload README.md with huggingface_hub

282151e verified 1 day ago

preview code

raw

history blame contribute delete

8.49 kB

metadata

license: other
license_name: dots-ocr-license
license_link: >-
  https://huggingface.co/davanstrien/dots.ocr-1.5/blob/main/dots.ocr-1.5%20LICENSE%20AGREEMENT
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - image-to-text
  - ocr
  - document-parse
  - layout
  - table
  - formula
  - custom_code
language:
  - en
  - zh
  - multilingual

Unofficial mirror. This is a copy of dots.ocr-1.5 from ModelScope, uploaded to Hugging Face for easier access. All credit goes to the original authors at rednote-hilab (Xiaohongshu). The original v1 model is at rednote-hilab/dots.ocr on HF. If the authors publish an official HF release of v1.5, please use that instead.

Source: ModelScope | GitHub

dots.ocr-1.5: Recognize Any Human Scripts and Symbols

A 3B-parameter multimodal OCR model (1.2B vision encoder + 1.7B language model) from rednote-hilab. Designed for universal accessibility, it can recognize virtually any human script and achieves SOTA performance in multilingual document parsing among models of comparable size.

Key Capabilities

Multilingual Document Parsing — SOTA on standard benchmarks among specialized OCR models, particularly strong on multilingual documents
Structured Graphics to SVG — Converts charts, diagrams, chemical formulas, and logos directly into SVG code
Web Screen Parsing & Scene Text Spotting — Handles web screenshots and scene text
Object Grounding & Counting — General vision tasks beyond pure OCR
General OCR & Visual QA — DocVQA 91.85, ChartQA 83.2, OCRBench 86.0

Quick Start with UV Scripts

Process any HF dataset with a single command using uv-scripts/ocr:

# Basic OCR
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
    https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr-1.5.py \
    your-input-dataset your-output-dataset \
    --model davanstrien/dots.ocr-1.5

# Layout analysis with bounding boxes
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
    https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr-1.5.py \
    your-input-dataset your-output-dataset \
    --model davanstrien/dots.ocr-1.5 \
    --prompt-mode layout-all

Benchmarks

Document Parsing (Elo Score)

Model	olmOCR-Bench	OmniDocBench v1.5	XDocParse
GLM-OCR	859.9	937.5	742.1
PaddleOCR-VL-1.5	873.6	965.6	797.6
HuanyuanOCR	978.9	974.4	895.9
dots.ocr	1027.4	994.7	1133.4
dots.ocr-1.5	1089.0	1025.8	1157.1
Gemini 3 Pro	1171.2	1102.1	1273.9

olmOCR-bench (detailed)

Model	ArXiv	Old scans math	Tables	Overall
olmOCR v0.4.0	83.0	82.3	84.9	82.4±1.1
Chandra OCR 0.1.0	82.2	80.3	88.0	83.1±0.9
dots.ocr-1.5	85.9	85.5	90.7	83.9±0.9

General Vision Tasks

DocVQA	ChartQA	OCRBench	AI2D	CharXiv Descriptive	RefCOCO
91.85	83.2	86.0	82.16	77.4	80.03

Usage

vLLM (recommended)

Important: When using llm.chat(), you must pass chat_template_content_format="string". The model's tokenizer chat template expects string content, not OpenAI-format lists. Without this, the model produces empty output.

from vllm import LLM, SamplingParams

llm = LLM(
    model="davanstrien/dots.ocr-1.5",
    trust_remote_code=True,
    max_model_len=24000,
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=24000)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Extract the text content from this image."},
    ],
}]

outputs = llm.chat(
    [messages],
    sampling_params,
    chat_template_content_format="string",  # Required!
)
print(outputs[0].outputs[0].text)

vLLM Server

vllm serve davanstrien/dots.ocr-1.5 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --chat-template-content-format string \
    --trust-remote-code

Transformers

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info

model = AutoModelForCausalLM.from_pretrained(
    "davanstrien/dots.ocr-1.5",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("davanstrien/dots.ocr-1.5", trust_remote_code=True)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "document.jpg"},
        {"type": "text", "text": "Extract the text content from this image."},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=24000)
output = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True,
)[0]
print(output)

Prompt Modes

Mode	Description	Output
`ocr`	Text extraction (default)	Markdown
`layout-all`	Layout + bboxes + categories + text	JSON
`layout-only`	Layout + bboxes + categories (no text)	JSON
`web-parsing`	Webpage layout analysis	JSON
`scene-spotting`	Scene text detection	Text
`grounding-ocr`	Text from bounding box region	Text
`general`	Free-form (custom prompt)	Varies

Bbox Coordinate System (layout modes)

Bounding boxes are in the resized image coordinate space, not original image coordinates. The model uses Qwen2VLImageProcessor which resizes images so that width × height ≤ 11,289,600 pixels, with dimensions rounded to multiples of 28.

To map bboxes back to original coordinates:

import math

def smart_resize(height, width, factor=28, min_pixels=3136, max_pixels=11289600):
    h_bar = max(factor, round(height / factor) * factor)
    w_bar = max(factor, round(width / factor) * factor)
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor
    return h_bar, w_bar

resized_h, resized_w = smart_resize(orig_h, orig_w)
scale_x, scale_y = orig_w / resized_w, orig_h / resized_h
# orig_x = bbox_x * scale_x, orig_y = bbox_y * scale_y

Model Details

Architecture: DotsOCRForCausalLM (custom code, trust_remote_code=True required)
Parameters: 3B total (1.2B vision encoder, 1.7B language model)
Precision: BF16
Max context: 131,072 tokens
Vision: Patch size 14, spatial merge size 2, flash_attention_2
Languages: English, Chinese (simplified + traditional), multilingual (Tibetan, Kannada, Russian, Dutch, and more)

Limitations

Complex table and formula extraction remains challenging for the compact 3B architecture
SVG parsing for pictures needs further robustness improvements
Occasional parsing failures on edge cases

License

This model is released under the dots.ocr License Agreement, which is based on the MIT License with supplementary terms covering responsible use, attribution, and data governance. Per the license: "If Licensee distributes modified weights or fine-tuned models based on the Model Materials, Licensee must prominently display the following statement: 'Built with dots.ocr.'"

Citation

@misc{dots_ocr_1_5,
  title={dots.ocr-1.5: Recognize Any Human Scripts and Symbols},
  author={rednote-hilab},
  year={2025},
  url={https://github.com/rednote-hilab/dots.ocr}
}

Built with dots.ocr.