Instructions to use AIDC-AI/OvisOCR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AIDC-AI/OvisOCR with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AIDC-AI/OvisOCR", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/OvisOCR", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AIDC-AI/OvisOCR with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AIDC-AI/OvisOCR"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/OvisOCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/AIDC-AI/OvisOCR

SGLang

How to use AIDC-AI/OvisOCR with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AIDC-AI/OvisOCR" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/OvisOCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AIDC-AI/OvisOCR" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/OvisOCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use AIDC-AI/OvisOCR with Docker Model Runner:
```
docker model run hf.co/AIDC-AI/OvisOCR
```

OvisOCR

Ovis

Introduction

We introduce OvisOCR, a lightweight end-to-end multimodal large language model (MLLM) tailored for high-fidelity document parsing. Unlike conventional Crop-OCR-Merge systems that rely on layout detection, localized cropping, specialized recognizers, and heuristic merging, OvisOCR directly maps full-page document images into structured Markdown outputs.

OvisOCR is designed for information-dense documents containing natural language text, tables, mathematical formulas, figures, and complex layouts. It preserves fine-grained textual fidelity while maintaining global document structure and human reading order. With only 1.3B parameters, OvisOCR achieves outstanding overall performance on OmniDocBench v1.5.

benchmark

Key Features

Strictly End-to-End Document Parsing
OvisOCR directly maps full-page visual signals to structured Markdown without localized slicing, layout-dependent recognition, or post-hoc merging. This streamlined paradigm reduces error propagation and improves global serialization consistency.
Synergistic Data Construction
Our data construction pipeline builds high-quality supervision by combining the strengths of a specialized OCR engine and a general-purpose MLLM. The specialized perceiver supplies dense local evidence, while the general reasoner checks for hallucinations, content completeness, table validity, formula syntax, and logical reading order.
Multi-Granularity Alignment
OvisOCR uses element-aware optimization for heterogeneous document constituents. Text, tables, and formulas are optimized with tailored reward signals, including edit-distance-based text fidelity, TEDS-based table similarity, and CDM-based formula visual correctness.
Strong Document Parsing Capability with Compact Scale
With only 1.3B parameters, OvisOCR achieves outstanding performance on OmniDocBench v1.5, surpassing strong specialized parsers, large general MLLMs, and traditional pipeline tools.

OvisOCR

Inference

pip install "vllm==0.18.1" pillow

from PIL import Image
from vllm import LLM, SamplingParams


class OvisOCRParser:
    def __init__(self, model_name_or_path: str):
        self.model = LLM(
            model=model_name_or_path,
            tensor_parallel_size=1,
            trust_remote_code=True,
            gpu_memory_utilization=0.8,
        )

        prompt = 'Extract all readable content from the image in natural human reading order and output the result as a single Markdown document. For charts or images, represent them using an HTML image tag: <' + 'img src="images/bbox_{left}_{top}_{right}_{bottom}.jpg" />, where left, top, right, bottom are bounding box coordinates scaled to [0, 1000). Format formulas as LaTeX. Format tables as HTML: <table>...</table>. Transcribe all other text as standard Markdown. Preserve the original text without translation or paraphrasing.'
        self.prompt = self.model.get_tokenizer().apply_chat_template(
            [{"role": "user", "content": f"<image>\n{prompt}"}],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False
        )

        self.sampling_params = SamplingParams(
            max_tokens=16384,
            temperature=0.0,
        )

    def _clean_truncated_repeats(
        self,
        text: str,
        min_text_len: int = 8000,
        max_period: int = 200,
        min_period: int = 1,
        min_repeat_chars: int = 100,
        min_repeat_times: int = 5
    ) -> str:
        n = len(text)
        if n < min_text_len:
            return text

        max_period = min(max_period, n - 1)
        for unit_len in range(min_period, max_period + 1):
            if text[n - 1] != text[n - 1 - unit_len]:
                continue

            match_len = 1
            idx = n - 2
            while idx >= unit_len and text[idx] == text[idx - unit_len]:
                match_len += 1
                idx -= 1

            total_len = match_len + unit_len
            repeat_times = total_len // unit_len
            tail_len = total_len % unit_len

            if repeat_times >= min_repeat_times and total_len >= min_repeat_chars:
                return text[: n - total_len + unit_len] + text[n - tail_len:]

        return text

    def parse(self, images: list[Image.Image], filter_imgtags: bool = True) -> list[str]:
        vllm_inputs = [
            {
                "prompt": self.prompt,
                "multi_modal_data": {"image": image},
                "mm_processor_kwargs": {
                    "images_kwargs": {
                        "min_pixels": 448 * 448,
                        "max_pixels": 2880 * 2880,
                    }
                }
            }
            for image in images
        ]

        outputs = self.model.generate(vllm_inputs, self.sampling_params)

        markdowns = []
        for output in outputs:
            text = output.outputs[0].text.strip()
            if filter_imgtags:
                text = "\n\n".join(
                    block
                    for block in text.split("\n\n")
                    if not block.strip().startswith('<img src="images/bbox_')
                )
            markdowns.append(self._clean_truncated_repeats(text))

        return markdowns


if __name__ == "__main__":
    parser = OvisOCRParser("AIDC-AI/OvisOCR")
    images = [Image.open("test1.jpg"), Image.open("test2.jpg")]
    markdowns = parser.parse(images)
    print(markdowns[0])

Citation

If you find OvisOCR useful, please consider citing our paper:

@inproceedings{jiang2026ovisocr,
  title = {{OvisOCR}: End-to-End Document Parsing via Aligning Specialized Perception with General Reasoning},
  author = {Jiang, Jun-Peng and Lu, Shiyin and Ji, An-Yang and Li, Yinglun and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and Ye, Han-Jia},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series = {Proceedings of Machine Learning Research},
  volume = {306},
  address = {Seoul, South Korea},
  publisher = {PMLR},
  year = {2026}
}

License

This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).

Disclaimer

We used automated filtering and quality-assurance procedures during data construction to reduce parsing errors such as repeated hallucinations, incomplete content, invalid table/formula structures, and reading-order inconsistencies. Due to the diversity and complexity of real-world documents, OvisOCR may still produce incorrect or incomplete outputs. Please manually verify results in critical applications.

Downloads last month: 6

Safetensors

Model size

1B params

Tensor type

BF16

Collection including AIDC-AI/OvisOCR

Ovis2.6

Collection

3 items • Updated about 5 hours ago • 7