Instructions to use AIDC-AI/OvisOCR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AIDC-AI/OvisOCR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AIDC-AI/OvisOCR", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AIDC-AI/OvisOCR", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AIDC-AI/OvisOCR with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AIDC-AI/OvisOCR" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AIDC-AI/OvisOCR", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/AIDC-AI/OvisOCR
- SGLang
How to use AIDC-AI/OvisOCR with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AIDC-AI/OvisOCR" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AIDC-AI/OvisOCR", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AIDC-AI/OvisOCR" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AIDC-AI/OvisOCR", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use AIDC-AI/OvisOCR with Docker Model Runner:
docker model run hf.co/AIDC-AI/OvisOCR
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/OvisOCR", trust_remote_code=True, dtype="auto")OvisOCR
Introduction
We introduce OvisOCR, a lightweight end-to-end multimodal large language model (MLLM) tailored for high-fidelity document parsing. Unlike conventional Crop-OCR-Merge systems that rely on layout detection, localized cropping, specialized recognizers, and heuristic merging, OvisOCR directly maps full-page document images into structured Markdown outputs.
OvisOCR is designed for information-dense documents containing natural language text, tables, mathematical formulas, figures, and complex layouts. It preserves fine-grained textual fidelity while maintaining global document structure and human reading order. With only 1.3B parameters, OvisOCR achieves outstanding overall performance on OmniDocBench v1.5.
Key Features
Strictly End-to-End Document Parsing
OvisOCR directly maps full-page visual signals to structured Markdown without localized slicing, layout-dependent recognition, or post-hoc merging. This streamlined paradigm reduces error propagation and improves global serialization consistency.Synergistic Data Construction
Our data construction pipeline builds high-quality supervision by combining the strengths of a specialized OCR engine and a general-purpose MLLM. The specialized perceiver supplies dense local evidence, while the general reasoner checks for hallucinations, content completeness, table validity, formula syntax, and logical reading order.Multi-Granularity Alignment
OvisOCR uses element-aware optimization for heterogeneous document constituents. Text, tables, and formulas are optimized with tailored reward signals, including edit-distance-based text fidelity, TEDS-based table similarity, and CDM-based formula visual correctness.Strong Document Parsing Capability with Compact Scale
With only 1.3B parameters, OvisOCR achieves outstanding performance on OmniDocBench v1.5, surpassing strong specialized parsers, large general MLLMs, and traditional pipeline tools.
Inference
pip install "vllm==0.18.1" pillow
from PIL import Image
from vllm import LLM, SamplingParams
class OvisOCRParser:
def __init__(self, model_name_or_path: str):
self.model = LLM(
model=model_name_or_path,
tensor_parallel_size=1,
trust_remote_code=True,
gpu_memory_utilization=0.8,
)
prompt = 'Extract all readable content from the image in natural human reading order and output the result as a single Markdown document. For charts or images, represent them using an HTML image tag: <' + 'img src="images/bbox_{left}_{top}_{right}_{bottom}.jpg" />, where left, top, right, bottom are bounding box coordinates scaled to [0, 1000). Format formulas as LaTeX. Format tables as HTML: <table>...</table>. Transcribe all other text as standard Markdown. Preserve the original text without translation or paraphrasing.'
self.prompt = self.model.get_tokenizer().apply_chat_template(
[{"role": "user", "content": f"<image>\n{prompt}"}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
self.sampling_params = SamplingParams(
max_tokens=16384,
temperature=0.0,
)
def _clean_truncated_repeats(
self,
text: str,
min_text_len: int = 8000,
max_period: int = 200,
min_period: int = 1,
min_repeat_chars: int = 100,
min_repeat_times: int = 5
) -> str:
n = len(text)
if n < min_text_len:
return text
max_period = min(max_period, n - 1)
for unit_len in range(min_period, max_period + 1):
if text[n - 1] != text[n - 1 - unit_len]:
continue
match_len = 1
idx = n - 2
while idx >= unit_len and text[idx] == text[idx - unit_len]:
match_len += 1
idx -= 1
total_len = match_len + unit_len
repeat_times = total_len // unit_len
tail_len = total_len % unit_len
if repeat_times >= min_repeat_times and total_len >= min_repeat_chars:
return text[: n - total_len + unit_len] + text[n - tail_len:]
return text
def parse(self, images: list[Image.Image], filter_imgtags: bool = True) -> list[str]:
vllm_inputs = [
{
"prompt": self.prompt,
"multi_modal_data": {"image": image},
"mm_processor_kwargs": {
"images_kwargs": {
"min_pixels": 448 * 448,
"max_pixels": 2880 * 2880,
}
}
}
for image in images
]
outputs = self.model.generate(vllm_inputs, self.sampling_params)
markdowns = []
for output in outputs:
text = output.outputs[0].text.strip()
if filter_imgtags:
text = "\n\n".join(
block
for block in text.split("\n\n")
if not block.strip().startswith('<img src="images/bbox_')
)
markdowns.append(self._clean_truncated_repeats(text))
return markdowns
if __name__ == "__main__":
parser = OvisOCRParser("AIDC-AI/OvisOCR")
images = [Image.open("test1.jpg"), Image.open("test2.jpg")]
markdowns = parser.parse(images)
print(markdowns[0])
Citation
If you find OvisOCR useful, please consider citing our paper:
@inproceedings{jiang2026ovisocr,
title = {{OvisOCR}: End-to-End Document Parsing via Aligning Specialized Perception with General Reasoning},
author = {Jiang, Jun-Peng and Lu, Shiyin and Ji, An-Yang and Li, Yinglun and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and Ye, Han-Jia},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
address = {Seoul, South Korea},
publisher = {PMLR},
year = {2026}
}
License
This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).
Disclaimer
We used automated filtering and quality-assurance procedures during data construction to reduce parsing errors such as repeated hallucinations, incomplete content, invalid table/formula structures, and reading-order inconsistencies. Due to the diversity and complexity of real-world documents, OvisOCR may still produce incorrect or incomplete outputs. Please manually verify results in critical applications.
- Downloads last month
- 6
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AIDC-AI/OvisOCR", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)