Image-Text-to-Text
Transformers
Safetensors
English
Chinese
multilingual
qwen3_5_moe
ocr
pdf
document-parsing
document-understanding
layout-analysis
table-recognition
chart-parsing
formula-recognition
chemical-formula
markdown
vision-language
infinity-parser
infinity_parser2
conversational
Eval Results
Instructions to use infly/Infinity-Parser2-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use infly/Infinity-Parser2-Pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="infly/Infinity-Parser2-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro") model = AutoModelForImageTextToText.from_pretrained("infly/Infinity-Parser2-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use infly/Infinity-Parser2-Pro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "infly/Infinity-Parser2-Pro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "infly/Infinity-Parser2-Pro", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/infly/Infinity-Parser2-Pro
- SGLang
How to use infly/Infinity-Parser2-Pro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "infly/Infinity-Parser2-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "infly/Infinity-Parser2-Pro", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "infly/Infinity-Parser2-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "infly/Infinity-Parser2-Pro", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use infly/Infinity-Parser2-Pro with Docker Model Runner:
docker model run hf.co/infly/Infinity-Parser2-Pro
| # Infinity-Parser2-Pro | |
| <p align="center"> | |
| <img src="assets/logo.png" width="400"/> | |
| <p> | |
| <p align="center"> | |
| 💻 <a href="https://github.com/infly-ai/INF-MLLM">Github</a> | | |
| 📊 <a>Dataset (coming soon...)</a> | | |
| 📄 <a>Paper (coming soon...)</a> | | |
| 🚀 <a>Demo (coming soon...)</a> | |
| </p> | |
| ## News | |
| - [2026-04-14] We uploaded the quick start guide for Infinity-Parser2. Feel free to contact us if you have any questions. | |
| - [2026-04-11] We released Infinity-Parser2-Pro, our flagship document parsing model — now available as a preview. Stay tuned: the official release, the lightweight Infinity-Parser2-Flash, and our multimodal parsing dataset Infinity-Doc2-10M are coming soon. | |
| ## Introduction | |
| We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios. | |
| ### Key Features | |
| - **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types. | |
| - **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown. | |
| - **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. | |
| - **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs. | |
| ## Performance | |
| <p align="left"> | |
| <img src="assets/document_parsing_performance_evaluation.png" width="1200"/> | |
| <p> | |
| ## Quick Start | |
| ### Installation | |
| #### Pre-requisites | |
| ```bash | |
| # Create a Conda environment (Optional) | |
| conda create -n infinity_parser2 python=3.12 | |
| conda activate infinity_parser2 | |
| # Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version. | |
| pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128 | |
| # Install FlashAttention (FlashAttention-2 is recommended by default) | |
| # Standard install (compiles from source, ~10-30 min): | |
| pip install flash-attn==2.8.3 --no-build-isolation | |
| # Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl | |
| # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention | |
| # NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2. | |
| # Install vLLM | |
| # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm. | |
| # pip uninstall -y pytorch-triton opencv-python opencv-python-headless numpy && rm -rf "$(python -c 'import site; print(site.getsitepackages()[0])')/cv2" | |
| pip install vllm==0.17.1 | |
| ``` | |
| #### Install infinity_parser2 | |
| Install from PyPI | |
| ```bash | |
| pip install infinity_parser2 | |
| ``` | |
| Install from source | |
| ```bash | |
| git clone https://github.com/infly-ai/INF-MLLM.git | |
| cd INF-MLLM/Infinity-Parser2 | |
| pip install -e . | |
| ``` | |
| ### Usage | |
| #### Command Line | |
| The `parser` command is the fastest way to get started. | |
| ```bash | |
| # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run. | |
| # Parse a PDF (outputs Markdown by default) | |
| parser demo_data/demo.pdf | |
| # Parse an image | |
| parser demo_data/demo.png | |
| # Batch parse multiple files | |
| parser demo_data/demo.pdf demo_data/demo.png -o ./output | |
| # Parse an entire directory | |
| parser demo_data -o ./output | |
| # Output raw JSON with layout bboxes | |
| parser demo_data/demo.pdf --output-format json | |
| # Convert to Markdown directly | |
| parser demo_data/demo.png --task doc2md | |
| ``` | |
| ```bash | |
| # View all options | |
| parser --help | |
| ``` | |
| #### Python API | |
| ```python | |
| # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run. | |
| from infinity_parser2 import InfinityParser2 | |
| parser = InfinityParser2() | |
| # Parse a single file (returns Markdown) | |
| result = parser.parse("demo_data/demo.pdf") | |
| print(result) | |
| # Parse multiple files (returns list) | |
| results = parser.parse(["demo_data/demo.pdf", "demo_data/demo.png"]) | |
| # Parse a directory (returns dict) | |
| results = parser.parse("demo_data") | |
| ``` | |
| **Output formats:** | |
| | task_type | Description | Default Output | | |
| |-------------|------------------------------------------------------|----------------| | |
| | `doc2json` | Extract layout elements with bboxes (default) | Markdown | | |
| | `doc2md` | Directly convert to Markdown | Markdown | | |
| | `custom` | Use your own prompt | Raw model output | | |
| ```python | |
| # doc2json: get raw JSON with bbox coordinates | |
| result = parser.parse("demo_data/demo.pdf", output_format="json") | |
| # doc2md: direct Markdown conversion | |
| result = parser.parse("demo_data/demo.pdf", task_type="doc2md") | |
| # Custom prompt | |
| result = parser.parse("demo_data/demo.pdf", task_type="custom", | |
| custom_prompt="Please transform the document's contents into Markdown format.") | |
| # Batch processing with custom batch size | |
| result = parser.parse("demo_data", batch_size=8) | |
| # Save results to directory | |
| parser.parse("demo_data/demo.pdf", output_dir="./output") | |
| ``` | |
| **Backends:** | |
| Infinity-Parser2 supports three inference backends. By default it uses the **vLLM Engine** (offline batch inference). | |
| ```python | |
| # vLLM Engine (default) — offline batch inference | |
| parser = InfinityParser2( | |
| model_name="infly/Infinity-Parser2-Pro", | |
| backend="vllm-engine", # default | |
| tensor_parallel_size=2, | |
| ) | |
| # Transformers — local single-GPU inference | |
| parser = InfinityParser2( | |
| model_name="infly/Infinity-Parser2-Pro", | |
| backend="transformers", | |
| device="cuda", | |
| torch_dtype="bfloat16", # "float16" or "bfloat16" | |
| ) | |
| # vLLM Server — online HTTP API (start server first) | |
| parser = InfinityParser2( | |
| model_name="infly/Infinity-Parser2-Pro", | |
| backend="vllm-server", | |
| api_url="http://localhost:8000/v1/chat/completions", | |
| api_key="EMPTY", | |
| ) | |
| ``` | |
| To start a vLLM server: | |
| ```bash | |
| vllm serve infly/Infinity-Parser2-Pro \ | |
| --trust-remote-code \ | |
| --reasoning-parser qwen3 \ | |
| --host 0.0.0.0 \ | |
| --port 8000 \ | |
| --tensor-parallel-size 2 \ | |
| --gpu-memory-utilization 0.85 \ | |
| --max-model-len 65536 \ | |
| --mm-encoder-tp-mode data \ | |
| --mm-processor-cache-type shm \ | |
| --enable-prefix-caching | |
| ``` | |
| For more details, please refer to the [official guide](https://github.com/infly-ai/INF-MLLM/blob/main/Infinity-Parser2). | |
| ## Acknowledgments | |
| We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models. | |
| ## License | |
| This model is licensed under apache-2.0. |