OvisOCR

Ovis

Introduction

We introduce OvisOCR, a lightweight end-to-end multimodal large language model (MLLM) tailored for high-fidelity document parsing. Unlike conventional Crop-OCR-Merge systems that rely on layout detection, localized cropping, specialized recognizers, and heuristic merging, OvisOCR directly maps full-page document images into structured Markdown outputs.

OvisOCR is designed for information-dense documents containing natural language text, tables, mathematical formulas, figures, and complex layouts. It preserves fine-grained textual fidelity while maintaining global document structure and human reading order. With only 1.3B parameters, OvisOCR achieves outstanding overall performance on OmniDocBench v1.5.

benchmark

Key Features

  • Strictly End-to-End Document Parsing
    OvisOCR directly maps full-page visual signals to structured Markdown without localized slicing, layout-dependent recognition, or post-hoc merging. This streamlined paradigm reduces error propagation and improves global serialization consistency.

  • Synergistic Data Construction
    Our data construction pipeline builds high-quality supervision by combining the strengths of a specialized OCR engine and a general-purpose MLLM. The specialized perceiver supplies dense local evidence, while the general reasoner checks for hallucinations, content completeness, table validity, formula syntax, and logical reading order.

  • Multi-Granularity Alignment
    OvisOCR uses element-aware optimization for heterogeneous document constituents. Text, tables, and formulas are optimized with tailored reward signals, including edit-distance-based text fidelity, TEDS-based table similarity, and CDM-based formula visual correctness.

  • Strong Document Parsing Capability with Compact Scale
    With only 1.3B parameters, OvisOCR achieves outstanding performance on OmniDocBench v1.5, surpassing strong specialized parsers, large general MLLMs, and traditional pipeline tools.

OvisOCR

Inference

pip install "vllm==0.18.1" pillow
from PIL import Image
from vllm import LLM, SamplingParams


class OvisOCRParser:
    def __init__(self, model_name_or_path: str):
        self.model = LLM(
            model=model_name_or_path,
            tensor_parallel_size=1,
            trust_remote_code=True,
            gpu_memory_utilization=0.8,
        )

        prompt = 'Extract all readable content from the image in natural human reading order and output the result as a single Markdown document. For charts or images, represent them using an HTML image tag: <' + 'img src="images/bbox_{left}_{top}_{right}_{bottom}.jpg" />, where left, top, right, bottom are bounding box coordinates scaled to [0, 1000). Format formulas as LaTeX. Format tables as HTML: <table>...</table>. Transcribe all other text as standard Markdown. Preserve the original text without translation or paraphrasing.'
        self.prompt = self.model.get_tokenizer().apply_chat_template(
            [{"role": "user", "content": f"<image>\n{prompt}"}],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False
        )

        self.sampling_params = SamplingParams(
            max_tokens=16384,
            temperature=0.0,
        )

    def _clean_truncated_repeats(
        self,
        text: str,
        min_text_len: int = 8000,
        max_period: int = 200,
        min_period: int = 1,
        min_repeat_chars: int = 100,
        min_repeat_times: int = 5
    ) -> str:
        n = len(text)
        if n < min_text_len:
            return text

        max_period = min(max_period, n - 1)
        for unit_len in range(min_period, max_period + 1):
            if text[n - 1] != text[n - 1 - unit_len]:
                continue

            match_len = 1
            idx = n - 2
            while idx >= unit_len and text[idx] == text[idx - unit_len]:
                match_len += 1
                idx -= 1

            total_len = match_len + unit_len
            repeat_times = total_len // unit_len
            tail_len = total_len % unit_len

            if repeat_times >= min_repeat_times and total_len >= min_repeat_chars:
                return text[: n - total_len + unit_len] + text[n - tail_len:]

        return text

    def parse(self, images: list[Image.Image], filter_imgtags: bool = True) -> list[str]:
        vllm_inputs = [
            {
                "prompt": self.prompt,
                "multi_modal_data": {"image": image},
                "mm_processor_kwargs": {
                    "images_kwargs": {
                        "min_pixels": 448 * 448,
                        "max_pixels": 2880 * 2880,
                    }
                }
            }
            for image in images
        ]

        outputs = self.model.generate(vllm_inputs, self.sampling_params)

        markdowns = []
        for output in outputs:
            text = output.outputs[0].text.strip()
            if filter_imgtags:
                text = "\n\n".join(
                    block
                    for block in text.split("\n\n")
                    if not block.strip().startswith('<img src="images/bbox_')
                )
            markdowns.append(self._clean_truncated_repeats(text))

        return markdowns


if __name__ == "__main__":
    parser = OvisOCRParser("AIDC-AI/OvisOCR")
    images = [Image.open("test1.jpg"), Image.open("test2.jpg")]
    markdowns = parser.parse(images)
    print(markdowns[0])

Citation

If you find OvisOCR useful, please consider citing our paper:

@inproceedings{jiang2026ovisocr,
  title = {{OvisOCR}: End-to-End Document Parsing via Aligning Specialized Perception with General Reasoning},
  author = {Jiang, Jun-Peng and Lu, Shiyin and Ji, An-Yang and Li, Yinglun and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and Ye, Han-Jia},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series = {Proceedings of Machine Learning Research},
  volume = {306},
  address = {Seoul, South Korea},
  publisher = {PMLR},
  year = {2026}
}

License

This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).

Disclaimer

We used automated filtering and quality-assurance procedures during data construction to reduce parsing errors such as repeated hallucinations, incomplete content, invalid table/formula structures, and reading-order inconsistencies. Due to the diversity and complexity of real-world documents, OvisOCR may still produce incorrect or incomplete outputs. Please manually verify results in critical applications.

Downloads last month
6
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AIDC-AI/OvisOCR