zuminghuang commited on
Commit
81ff221
·
verified ·
1 Parent(s): 221731e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +499 -37
README.md CHANGED
@@ -6,68 +6,530 @@
6
 
7
  <p align="center">
8
  💻 <a href="https://github.com/infly-ai/INF-MLLM">Github</a> |
9
- 📊 <a href="https://huggingface.co/datasets/infly/Infinity-Doc2-xxx">Dataset</a> |
10
  📄 <a>Paper (coming soon...)</a> |
11
- 🚀 <a>Demo (coming soon...)</a>
12
  </p>
13
 
14
- # Introduction
15
 
16
- We are delighted to release Infinity-Parser2-2B, our latest state-of-the-art document understanding model. Compared to our prior model, Infinity-Parser-7B, we have deeply optimized our data engine and multi-task reinforcement learning. We have successfully condensed robust multi-modal parsing capabilities into a highly efficient 2B-parameter model, offering massive speedups and brand-new zero-shot capabilities for real-world business scenarios.
17
 
18
- ## Key Features
19
 
20
- - **Upgraded Data Engine**: We comprehensively upgraded our data engine by adding over 1 million diverse full-text samples, 170K synthetic financial tables, 900K formulas, and targeted negative samples to mitigate hallucinations. Combined with a dynamic adaptive sampling strategy, this ensures highly balanced and robust multi-task learning across various document types.
21
- - **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling the model to seamlessly and simultaneously co-optimize multiple complex tasks, including full-text parsing, table and formula extraction, layout analysis, and document VQA.
22
- - **Breakthrough Parsing Performance**: Despite its compact 2B size, it significantly outperforms our previous 7B model. It achieves State-of-the-Art (SOTA) results on both in-house financial benchmarks (`FinDocBench`, `FinTabBench`)—surpassing frontier models like DeepSeek-OCR2 and GLM-OCR—and public sets like `olmOCR-Bench` and `PubTabNet`, while maintaining highly competitive general multimodal capabilities.
23
- - **Massive Inference Acceleration (3.68x Faster)**: By transitioning to the highly efficient Qwen3-VL-2B architecture, our inference throughput has surged by **3.68x** (jumping from 441 to 1,624 tokens/sec), dramatically slashing deployment latency and costs without compromising core parsing accuracy.
24
- - **Expanded Capabilities**: We have unlocked entirely new zero-shot skills in this release, achieving strong benchmark results in chart parsing (`Chart2Table`), chemical structure recognition (including our new `ChemDraw-198`), and layout analysis, where it successfully matches the performance of dedicated specialized models like DocLayout-YOLO.
25
 
26
- # Architecture
27
 
28
- todo
 
 
 
 
 
 
29
 
30
- # Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- ## Document Parsing
33
- ![image](assets/xxx.png)
34
 
35
- ## Table Parsing
36
- ![image](assets/xxx.png)
37
 
38
- ## Math Formula Parsing
39
- ![image](assets/xxx.png)
40
 
41
- ## Chart Parsing
42
- ![image](assets/xxx.png)
43
 
44
- ## Chemical Formula Parsing
45
- ![image](assets/xxx.png)
 
 
 
46
 
47
- ## General Multimodal Understanding
48
- ![image](assets/xxx.png)
 
 
 
 
 
49
 
50
- # Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- todo
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- # Visualization
55
 
56
- ## Comparison Examples
57
- ![image](assets/xxx.jpeg)
 
 
58
 
59
- # Limitation & Future Work
 
 
 
 
 
 
60
 
61
- ## Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- ## Future Work
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
- # Acknowledgments
66
- We would like to thank [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [ms-swift](https://github.com/modelscope/ms-swift), [verl](https://github.com/verl-project/verl), [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) and [OmniDocBench](https://github.com/opendatalab/OmniDocBench) for providing data, code and models.
67
 
68
- # Citation
69
 
70
- Coming soon...
71
 
72
  # License
73
 
 
6
 
7
  <p align="center">
8
  💻 <a href="https://github.com/infly-ai/INF-MLLM">Github</a> |
9
+ 📊 <a href="https://huggingface.co/datasets/infly/Infinity-Doc2-5M">Dataset</a> |
10
  📄 <a>Paper (coming soon...)</a> |
11
+ 🚀 <a href="https://huggingface.co/spaces/infly/Infinity-Parser2-Demo">Demo</a>
12
  </p>
13
 
14
+ ## Introduction
15
 
16
+ We are excited to release Infinity-Parser2, our latest flagship document understanding model. We offer two distinct variants to address diverse deployment constraints: Infinity-Parser2-Pro, optimized for maximum accuracy in precision-critical tasks, achieves state-of-the-art results on olmOCR-Bench (87.6%) and ParseBench (74.3%), surpassing frontier models including DeepSeek-OCR-2, PaddleOCR-VL-1.5, and MinerU-2.5. Infinity-Parser2-Flash, engineered for low-latency inference, delivers a 3.68x speedup over our previous Infinity-Parser-7B model. With significant upgrades to both our data engine and multi-task reinforcement learning approach, the model consolidates robust multi-modal parsing capabilities into a unified architecture, unlocking brand-new zero-shot capabilities across a wide range of real-world business scenarios.
17
 
18
+ ### Key Features
19
 
20
+ - **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By curating nearly 5 million diverse document parsing samples across a wide range of layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
21
+ - **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including document parsing, element parsing, chart parsing, chemical formula parsing, document vqa, and general multimodal understanding.
22
+ - **Breakthrough Parsing Performance**: Infinity-Parser2-Pro substantially outperforms our previous 7B model, achieving 87.6% on olmOCR-Bench and 74.3% on ParseBench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and MinerU-2.5.
23
+ - **Inference Acceleration**: Infinity-Parser2-Flash delivers significantly higher efficiency than Infinity-Parser-7B, with inference throughput increased by 3.68x (from 441 to 1,624 tokens/sec), reducing both deployment latency and costs.
 
24
 
25
+ ## Performance
26
 
27
+ <p align="left">
28
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/olmocr_bench_perf.png" width="1200"/>
29
+ <p>
30
+
31
+ <p align="left">
32
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/parsebench_perf.png" width="1200"/>
33
+ <p>
34
 
35
+ <table align="center" style="table-layout:fixed;width:100%;font-size:10px">
36
+ <thead>
37
+ <tr>
38
+ <th>Task</th>
39
+ <th>Infinity-Parser2-Pro</th>
40
+ <th>Infinity-Parser2-Flash</th>
41
+ <th>PaddleOCR-VL-1.5</th>
42
+ <th>DeepSeek-OCR-2</th>
43
+ <th>MinerU-2.5</th>
44
+ <th>Gemini-3-Pro</th>
45
+ </tr>
46
+ </thead>
47
+ <tbody>
48
+ <tr>
49
+ <td colspan=7><b>Document Parsing</b></td>
50
+ </tr>
51
+ <tr>
52
+ <td>olmOCR-bench</td>
53
+ <td><b>87.6</b></td>
54
+ <td>86.0</td>
55
+ <td>80.0†</td>
56
+ <td>76.3</td>
57
+ <td>75.2</td>
58
+ <td>-</td>
59
+ </tr>
60
+ <tr>
61
+ <td>ParseBench</td>
62
+ <td><b>74.3</b></td>
63
+ <td>72.2</td>
64
+ <td>40.9†</td>
65
+ <td>41.2</td>
66
+ <td>45.9</td>
67
+ <td>69.1‡</td>
68
+ </tr>
69
+ <tr>
70
+ <td>OmniDocBench-v1.6</td>
71
+ <td>93.95</td>
72
+ <td>91.98</td>
73
+ <td><b>94.87</b></td>
74
+ <td>90.17</td>
75
+ <td>92.98</td>
76
+ <td>92.85</td>
77
+ </tr>
78
+ <tr>
79
+ <td colspan=7>Layout Analysis (mIoU)</td>
80
+ </tr>
81
+ <tr>
82
+ <td>DocLayNet</td>
83
+ <td>64.93*</td>
84
+ <td>64.97*</td>
85
+ <td><b>71.05*</b></td>
86
+ <td>45.62*</td>
87
+ <td>67.74*</td>
88
+ <td>-</td>
89
+ </tr>
90
+ <tr>
91
+ <td>D4LA</td>
92
+ <td><b>52.41*</b></td>
93
+ <td>46.05*</td>
94
+ <td>50.21*</td>
95
+ <td>33.03*</td>
96
+ <td>51.62*</td>
97
+ <td>-</td>
98
+ </tr>
99
+ <tr>
100
+ <td>OmniDocBench-v1.5-Layout</td>
101
+ <td>74.56*</td>
102
+ <td>73.07*</td>
103
+ <td>74.80*</td>
104
+ <td>55.28*</td>
105
+ <td><b>76.28*</b></td>
106
+ <td>-</td>
107
+ </tr>
108
+ <tr>
109
+ <td colspan=7>Element Parsing</td>
110
+ </tr>
111
+ <tr>
112
+ <td>OmniDocBench-v1.5-TextBlock</td>
113
+ <td>93.66</td>
114
+ <td>93.53</td>
115
+ <td><b>94.97*</b></td>
116
+ <td>84.13*</td>
117
+ <td>86.00</td>
118
+ <td>-</td>
119
+ </tr>
120
+ <tr>
121
+ <td>PubTabNet (val)</td>
122
+ <td><b>94.76</b></td>
123
+ <td>92.41</td>
124
+ <td>84.60</td>
125
+ <td>89.53*</td>
126
+ <td>89.07</td>
127
+ <td>91.40</td>
128
+ </tr>
129
+ <tr>
130
+ <td>UniMERNet</td>
131
+ <td><b>97.7</b></td>
132
+ <td>96.5</td>
133
+ <td>95.8*</td>
134
+ <td>79.8*</td>
135
+ <td>96.5</td>
136
+ <td>96.4</td>
137
+ </tr>
138
+ <tr>
139
+ <td colspan=7>Chart Parsing</td>
140
+ </tr>
141
+ <tr>
142
+ <td>Chart2Table</td>
143
+ <td>80.45</td>
144
+ <td>80.49</td>
145
+ <td><b>86.2*</b></td>
146
+ <td>-</td>
147
+ <td>-</td>
148
+ <td>-</td>
149
+ </tr>
150
+ <tr>
151
+ <td>Chart2Json</td>
152
+ <td><b>73.69</b></td>
153
+ <td>67.66</td>
154
+ <td>-</td>
155
+ <td>-</td>
156
+ <td>-</td>
157
+ <td>-</td>
158
+ </tr>
159
+ <tr>
160
+ <td colspan=7>Chemical Formula Parsing</td>
161
+ </tr>
162
+ <tr>
163
+ <td>CoSyn_Chemical</td>
164
+ <td><b>71.48</b></td>
165
+ <td>62.08</td>
166
+ <td>-</td>
167
+ <td>52.16*</td>
168
+ <td>-</td>
169
+ <td>-</td>
170
+ </tr>
171
+ <tr>
172
+ <td colspan=7>Document VQA</td>
173
+ </tr>
174
+ <tr>
175
+ <td>DocVQA (val)</td>
176
+ <td><b>96.43</b></td>
177
+ <td>93.16</td>
178
+ <td>-</td>
179
+ <td>43.42*</td>
180
+ <td>-</td>
181
+ <td>93.68*</td>
182
+ </tr>
183
+ <tr>
184
+ <td>InfoVQA (val)</td>
185
+ <td><b>86.26</b></td>
186
+ <td>75.94</td>
187
+ <td>-</td>
188
+ <td>22.07*</td>
189
+ <td>-</td>
190
+ <td>85.24*</td>
191
+ </tr>
192
+ <tr>
193
+ <td colspan=7>General Multimodal Understanding</td>
194
+ </tr>
195
+ <tr>
196
+ <td>AI2D</td>
197
+ <td>88.89</td>
198
+ <td>79.53</td>
199
+ <td>-</td>
200
+ <td>37.66*</td>
201
+ <td>-</td>
202
+ <td><b>91.87*</b></td>
203
+ </tr>
204
+ <tr>
205
+ <td>MathVista (testmini)</td>
206
+ <td>71.4</td>
207
+ <td>59.5</td>
208
+ <td>-</td>
209
+ <td>-</td>
210
+ <td>-</td>
211
+ <td><b>81.8*</b></td>
212
+ </tr>
213
+ <tr>
214
+ <td>MMBench-EN (dev)</td>
215
+ <td>87.54</td>
216
+ <td>77.92</td>
217
+ <td>-</td>
218
+ <td>-</td>
219
+ <td>-</td>
220
+ <td><b>90.29*</b></td>
221
+ </tr>
222
+ <tr>
223
+ <td>MMBench-CN (dev)</td>
224
+ <td>86.43</td>
225
+ <td>75.77</td>
226
+ <td>-</td>
227
+ <td>-</td>
228
+ <td>-</td>
229
+ <td><b>90.98*</b></td>
230
+ </tr>
231
+ <tr>
232
+ <td>MMMU (val)</td>
233
+ <td><b>61.89</b></td>
234
+ <td>45.89</td>
235
+ <td>-</td>
236
+ <td>-</td>
237
+ <td>-</td>
238
+ <td>56.00*</td>
239
+ </tr>
240
+ <tr>
241
+ <td>MMStar</td>
242
+ <td>69.66</td>
243
+ <td>57.13</td>
244
+ <td>-</td>
245
+ <td>-</td>
246
+ <td>-</td>
247
+ <td><b>83.78*</b></td>
248
+ </tr>
249
+ <tr>
250
+ <td>OCRBench</td>
251
+ <td>86.20</td>
252
+ <td>81.60</td>
253
+ <td>-</td>
254
+ <td>47.20*</td>
255
+ <td>-</td>
256
+ <td><b>89.30*</b></td>
257
+ </tr>
258
+ </tbody>
259
+ </table>
260
 
261
+ Note: '*' denotes results evaluated using our internal evaluation tools. Results marked with '†' are from PaddleOCR-VL. '‡' denotes results from the Gemini-3.1-Pro.
 
262
 
263
+ ## Quick Start
 
264
 
265
+ ### 1. Minimal "Hello World" (Native Transformers)
 
266
 
267
+ If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
 
268
 
269
+ ```python
270
+ from PIL import Image
271
+ import torch
272
+ from transformers import AutoModelForImageTextToText, AutoProcessor
273
+ from qwen_vl_utils import process_vision_info
274
 
275
+ # Load the model and processor
276
+ model = AutoModelForImageTextToText.from_pretrained(
277
+ "infly/Infinity-Parser2-Pro",
278
+ torch_dtype="float16",
279
+ device_map="auto",
280
+ )
281
+ processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
282
 
283
+ # Build the messages for the model
284
+ pil_image = Image.open("demo_data/demo.png").convert("RGB")
285
+ min_pixels = 2048 # 32 * 64
286
+ max_pixels = 16777216 # 4096 * 4096
287
+ prompt = """
288
+ Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
289
+ 1. Bbox format: [x1, y1, x2, y2]
290
+ 2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
291
+ 3. Text Extraction & Formatting Rules:
292
+ - Figure: For the 'figure' category, the text field should be empty string.
293
+ - Formula: Format its text as LaTeX.
294
+ - Table: Format its text as HTML.
295
+ - All Others (Text, Title, etc.): Format their text as Markdown.
296
+ 4. Constraints:
297
+ - The output text must be the original text from the image, with no translation.
298
+ - All layout elements must be sorted according to human reading order.
299
+ 5. Final Output: The entire output must be a single JSON object.
300
+ """
301
 
302
+ messages = [
303
+ {
304
+ "role": "user",
305
+ "content": [
306
+ {
307
+ "type": "image",
308
+ "image": pil_image,
309
+ "min_pixels": min_pixels,
310
+ "max_pixels": max_pixels,
311
+ },
312
+ {"type": "text", "text": prompt},
313
+ ],
314
+ }
315
+ ]
316
 
317
+ chat_template_kwargs = {"enable_thinking": False}
318
 
319
+ text = processor.apply_chat_template(
320
+ messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
321
+ )
322
+ image_inputs, _ = process_vision_info(messages, image_patch_size=16)
323
 
324
+ inputs = processor(
325
+ text=text,
326
+ images=image_inputs,
327
+ do_resize=False,
328
+ padding=True,
329
+ return_tensors="pt",
330
+ )
331
 
332
+ # Move all tensors to the same device as the model
333
+ inputs = {
334
+ k: v.to(model.device) if isinstance(v, torch.Tensor) else v
335
+ for k, v in inputs.items()
336
+ }
337
+
338
+ # Generate the response
339
+ generated_ids = model.generate(
340
+ **inputs,
341
+ max_new_tokens=32768,
342
+ temperature=0.0,
343
+ top_p=1.0,
344
+ )
345
+
346
+ # Strip input tokens, keeping only the newly generated response
347
+ generated_ids_trimmed = [
348
+ out_ids[len(in_ids) :]
349
+ for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
350
+ ]
351
+ output_text = processor.batch_decode(
352
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
353
+ )
354
+ print(output_text)
355
+ ```
356
+
357
+ ### 2. Advanced Pipeline (infinity_parser2)
358
+
359
+ For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
360
+
361
+ #### Pre-requisites
362
+
363
+ ```bash
364
+ # Create a Conda environment (Optional)
365
+ conda create -n infinity_parser2 python=3.12
366
+ conda activate infinity_parser2
367
+
368
+ # Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
369
+ pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
370
+
371
+ # Install FlashAttention (FlashAttention-2 is recommended by default)
372
+ # Standard install (compiles from source, ~10-30 min):
373
+ pip install flash-attn==2.8.3 --no-build-isolation
374
+ # Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
375
+ # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
376
+ # NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
377
+
378
+ # Install vLLM
379
+ # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
380
+ # pip uninstall -y pytorch-triton opencv-python opencv-python-headless numpy && rm -rf "$(python -c 'import site; print(site.getsitepackages()[0])')/cv2"
381
+ pip install vllm==0.17.1
382
+ ```
383
+
384
+ #### Install infinity_parser2
385
+
386
+ Install from PyPI
387
+
388
+ ```bash
389
+ pip install infinity_parser2
390
+ ```
391
+
392
+ Install from source code
393
+
394
+ ```bash
395
+ git clone https://github.com/infly-ai/INF-MLLM.git
396
+ cd INF-MLLM/Infinity-Parser2
397
+ pip install -e .
398
+ ```
399
+
400
+ #### Usage
401
+
402
+ ##### Command Line
403
+
404
+ The `parser` command is the fastest way to get started.
405
+
406
+ ```bash
407
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
408
+
409
+ # Parse a PDF (outputs Markdown by default)
410
+ parser demo_data/demo.pdf
411
+
412
+ # Parse an image
413
+ parser demo_data/demo.png
414
+
415
+ # Batch parse multiple files
416
+ parser demo_data/demo.pdf demo_data/demo.png -o ./output
417
+
418
+ # Parse an entire directory
419
+ parser demo_data -o ./output
420
 
421
+ # Output raw JSON with layout bboxes
422
+ parser demo_data/demo.pdf --output-format json
423
+
424
+ # Convert to Markdown directly
425
+ parser demo_data/demo.png --task doc2md
426
+ ```
427
+
428
+ ```bash
429
+ # View all options
430
+ parser --help
431
+ ```
432
+
433
+ ##### Python API
434
+
435
+ ```python
436
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
437
+
438
+ from infinity_parser2 import InfinityParser2
439
+
440
+ parser = InfinityParser2()
441
+
442
+ # Parse a single file (returns Markdown)
443
+ result = parser.parse("demo_data/demo.pdf")
444
+ print(result)
445
+
446
+ # Parse multiple files (returns list)
447
+ results = parser.parse(["demo_data/demo.pdf", "demo_data/demo.png"])
448
+
449
+ # Parse a directory (returns dict)
450
+ results = parser.parse("demo_data")
451
+ ```
452
+
453
+ **Output formats:**
454
+
455
+ | task_type | Description | Default Output |
456
+ |-------------|------------------------------------------------------|----------------|
457
+ | `doc2json` | Extract layout elements with bboxes (default) | Markdown |
458
+ | `doc2md` | Directly convert to Markdown | Markdown |
459
+ | `custom` | Use your own prompt | Raw model output |
460
+
461
+ ```python
462
+ # doc2json: get raw JSON with bbox coordinates
463
+ result = parser.parse("demo_data/demo.pdf", output_format="json")
464
+
465
+ # doc2md: direct Markdown conversion
466
+ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
467
+
468
+ # Custom prompt
469
+ result = parser.parse("demo_data/demo.pdf", task_type="custom",
470
+ custom_prompt="Please transform the document's contents into Markdown format.")
471
+
472
+ # Batch processing with custom batch size
473
+ result = parser.parse("demo_data", batch_size=8)
474
+
475
+ # Save results to directory
476
+ parser.parse("demo_data/demo.pdf", output_dir="./output")
477
+ ```
478
+
479
+ **Backends:**
480
+
481
+ Infinity-Parser2 supports three inference backends. By default it uses the **vLLM Engine** (offline batch inference).
482
+
483
+ ```python
484
+ # vLLM Engine (default) — offline batch inference
485
+ parser = InfinityParser2(
486
+ model_name="infly/Infinity-Parser2-Pro",
487
+ backend="vllm-engine", # default
488
+ tensor_parallel_size=2,
489
+ )
490
+
491
+ # Transformers — local single-GPU inference
492
+ parser = InfinityParser2(
493
+ model_name="infly/Infinity-Parser2-Pro",
494
+ backend="transformers",
495
+ device="cuda",
496
+ torch_dtype="bfloat16", # "float16" or "bfloat16"
497
+ )
498
+
499
+ # vLLM Server — online HTTP API (start server first)
500
+ parser = InfinityParser2(
501
+ model_name="infly/Infinity-Parser2-Pro",
502
+ backend="vllm-server",
503
+ api_url="http://localhost:8000/v1/chat/completions",
504
+ api_key="EMPTY",
505
+ )
506
+ ```
507
+
508
+ To start a vLLM server:
509
+
510
+ ```bash
511
+ vllm serve infly/Infinity-Parser2-Pro \
512
+ --trust-remote-code \
513
+ --reasoning-parser qwen3 \
514
+ --host 0.0.0.0 \
515
+ --port 8000 \
516
+ --tensor-parallel-size 2 \
517
+ --gpu-memory-utilization 0.85 \
518
+ --max-model-len 65536 \
519
+ --mm-encoder-tp-mode data \
520
+ --mm-processor-cache-type shm \
521
+ --enable-prefix-caching
522
+ ```
523
+
524
+ For more details, please refer to the [official guide](https://github.com/infly-ai/INF-MLLM/blob/main/Infinity-Parser2).
525
+
526
+ ## Limitations
527
 
528
+ Infinity-Parser2 has several known limitations to consider. It primarily supports English and Chinese documents, and performance degrades when processing multilingual content. Accuracy may also be reduced when parsing charts with complex layouts, as well as documents containing multi-oriented elements such as table rotated at varying angles. Additionally, the model does not capture fine-grained text formatting (e.g., bold, italic, strikethrough) and exhibits suboptimal multimodal instruction-following capability, meaning it may not always reliably follow complex multi-step visual instructions.
 
529
 
530
+ ## Acknowledgments
531
 
532
+ We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.
533
 
534
  # License
535