DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
This repository hosts DOCR-Inspector-7B, a Vision-Language Model (VLM) for fine-grained and automated evaluation of document parsing, as presented in the paper DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM.
DOCR-Inspector is a VLM-based evaluation framework designed to automatically assess document parsing results without requiring ground-truth annotations. This repository includes DOCR-Inspector-7B, a document parsing evaluation model fine-tuned from Qwen2.5-VL-7B-Instruct, along with inference & evaluation code demo.
For more information, code, and datasets, visit the official GitHub repository: DOCR-Inspector GitHub. The associated dataset can be found here: DOCRcase-Datasets
π Introduction
DOCR-Inspector is a Vision-Language Model (VLM) designed for quality inspection of document parsing elements. It takes document element images and their corresponding parsing results as input, detects errors in the parsed content, categorizes them into 28 fine-grained error types, and delivers detailed quality assessment feedback. This approach formalizes document parsing assessment as fine-grained error detection and analysis, leveraging a VLM-as-a-Judge paradigm.
π Key Features
- No Ground Truth Needed β Evaluates parsing results directly, enabling scalable real-world document quality assessment.
- 28 Fine-grained Error Types β Covers text, tables, formulas with multi-level error granularity.
- Reliable Quality Judgement β Equipped with Chain-of-Checklist (CoCL) reasoning, ensuring robust error discovery & explainable evaluation reports.
π§© Examples
The GitHub repository provides detailed examples for Text, Table, and Equation elements, showcasing the model's fine-grained error detection and quality assessment. You can find full examples at: DOCR-Inspector Examples
π Full definition of error types available at: assets/error_type_definition.json
π DOCRcase-200K & DOCRcaseBench
DOCRcase-200K
DOCRcase-200K is a large-scale dataset designed for fine-grained error detection and analysis. It contains 212K element-level parsing cases spanning 28 error types across text, table and equation elements; each error is paired with detailed reasoning annotations.
DOCRcaseBench
DOCRcaseBench is a high-quality benchmark dataset tailored for evaluating document quality assessment models. It comprises real parsed outputs from several state-of-the-art models, including MinerU2.0-pipeline, PP-StructureV3, GPT-4o, Qwen2.5-VL-7B-Instruct, MonkeyOCR-1.2B-Pro, and MinerU2.0-VLM. These models were selected as they represent strong, yet imperfect, performance across various benchmarks. To ensure a balanced distribution of error types for robust evaluation, we meticulously supplemented the dataset with additional, hand-crafted examples. Every parsing result is annotated with human-verified error types.
The overall composition of DOCRcaseBench by model source is detailed in the table below.
| Model | Count | Percentage |
|---|---|---|
| MonkeyOCR-1.2B-Pro | 142 | 16.1% |
| PP-StructureV3 | 99 | 11.2% |
| MinerU2.0-pipeline | 149 | 16.9% |
| MinerU2.0-VLM | 22 | 2.5% |
| GPT-4o | 181 | 20.5% |
| Qwen2.5-VL-7B-Instruct | 105 | 11.9% |
| Experts (Hand-crafted/Supplemented) | 64 | 7.3% |
| Total | 882 | 100.0% |
The distribution of document elements (cases) in DOCRcaseBench is summarized below:
| Text | Table | Equation | Total | |
|---|---|---|---|---|
| Good Case | 39 | 46 | 62 | 147 |
| Bad Case with Single Error | 339 | 141 | 81 | 561 |
| Bad Case with Multi Error | 70 | 55 | 49 | 174 |
| Total | 448 | 242 | 192 | 882 |
π₯ Performance
We present the evaluation results of various models on the DOCRcaseBench.
- F1 of Case: Measures the model's accuracy in the binary classification of output quality (Good/Bad).
- Recall, Precision, and F1 of Error Type: Quantify the model's performance in detecting and correctly classifying the specific error types within the document parsing results.
DOCR-Inspector-7B achieves state-of-the-art results across all element types.
| Model | Text | Table | Equation | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Case | Error Type | Case | Error Type | Case | Error Type | |||||||
| F1 | Recall | F1 | Precision | F1 | Recall | F1 | Precision | F1 | Recall | F1 | Precision | |
| Proprietary Non-Reasoning Models | ||||||||||||
| GPT-4o w/o CoT | 72.05 | 31.66 | 28.8 | 28.04 | 73.69 | 29.89 | 26.36 | 25.03 | 79.2 | 49.31 | 47.2 | 46.31 |
| GPT-4o w/ CoT | 77.69 | 30.54 | 27.25 | 26.35 | 81.23 | 34.23 | 29.64 | 28.17 | 79.38 | 46.44 | 45.45 | 45.4 |
| Gemini 2.5 Flash w/o CoT | 84.89 | 43.29 | 29.88 | 25.43 | 82.21 | 41.94 | 25.97 | 21.29 | 80.46 | 53.73 | 48.17 | 45.96 |
| Gemini 2.5 Flash w/ CoT | 84.75 | 42.24 | 29.74 | 25.69 | 81.16 | 42.36 | 24.1 | 19.25 | 80.94 | 50.61 | 46.17 | 44.63 |
| Open-source Non-Reasoning Models | ||||||||||||
| Qwen2.5-VL-7B-Instruct w/o CoT | 46.15 | 12.28 | 11.98 | 11.83 | 48.8 | 19.42 | 19.42 | 19.42 | 55.8 | 32.81 | 32.81 | 32.81 |
| Qwen2.5-VL-7B-Instruct w/ CoT | 38.17 | 12.05 | 11.64 | 11.5 | 43.48 | 21.56 | 21.72 | 22.11 | 68.1 | 32.29 | 32.12 | 32.03 |
| Qwen2.5-VL-72B-Instruct w/o CoT | 82.68 | 28.49 | 24.74 | 23.43 | 83.51 | 40.91 | 33.94 | 31.03 | 78.51 | 39.93 | 37.19 | 35.76 |
| Qwen2.5-VL-72B-Instruct w/ CoT | 74.55 | 30.97 | 26.23 | 24.56 | 76.82 | 40.7 | 31.77 | 28.43 | 79.14 | 44.53 | 41.23 | 39.79 |
| Reasoning Models | ||||||||||||
| Qwen3-VL-235B-A22B-Thinking | 83.9 | 42.02 | 31.19 | 27.46 | 83.13 | 39.12 | 28.57 | 25.49 | 78.56 | 40.8 | 38.45 | 37.76 |
| Gemini 2.5 Pro Thinking | 88.46 | 47.17 | 32.9 | 28.16 | 82.01 | 43.60 | 32.93 | 29.63 | 77.19 | 53.04 | 48.58 | 47.27 |
| Ours: | ||||||||||||
| DOCR-Inspector-7B | 96.43 | 81.06 | 80.21 | 81.03 | 86.41 | 63.09 | 62.11 | 62.95 | 85.42 | 74.39 | 73.81 | 74.48 |
π οΈ Usage
For more details on installation and usage, please visit the DOCR-Inspector GitHub repository.
Installation
DOCR-Inspector-7B is trained based on Qwen2.5-VL-7B-Instruct, so you can follow the Qwen2.5-VL-7B-Instruct installation guide.
We highly recommend installing vLLM >= 0.7.2 to improve inference speed.
Inference with vLLM
Prepare your element-cropped image and the corresponding parsing results. The required data format should conform to the structure found in ./DOCR-Inspector/demo_data on the GitHub repository.
Then, run the following command to perform inference:
python run_case_inf_vllm.py --model_path ZQTTTT/DOCR-Inspector-7B --image_path /path/to/image --ocr_path /path/to/parsing_result
Evaluation
Download DOCRcase- dataset from DOCRcaseBench. We provide a complete evaluation pipeline that supports inference using DOCR-Inspector, API models, and vLLM.
| Component | Description | Path |
|---|---|---|
| vLLM Inference Scripts | Run DOCR-Inspector locally | bench_inf_DOCR-Inspector.py |
| vLLM Inference Scripts | Run other VLM locally | bench_inf_qwenvl_vllm.py |
| API Evaluation Scripts | Evaluate GPT/Gemini etc. | bench_inf_api.py |
| Pre-computed Paper Results | Results used in the main paper | evaluation/results |
| Metric Computation Notebook | Compute F1/Precision/Recall | metrics.ipynb |
Acknowledgements
Citation
If you find our work helpful or inspiring, please feel free to cite it:
@misc{zhou2024docrinspector,
title={DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM},
author={Yifei Zhou and Qianlan Yang and Kaixiang Lin and Min Bai and Xiong Zhou and Yu-Xiong Wang and Sergey Levine and Erran Li},
year={2024},
eprint={2512.10619},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 33