DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

This repository hosts DOCR-Inspector-7B, a Vision-Language Model (VLM) for fine-grained and automated evaluation of document parsing, as presented in the paper DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM.

DOCR-Inspector is a VLM-based evaluation framework designed to automatically assess document parsing results without requiring ground-truth annotations. This repository includes DOCR-Inspector-7B, a document parsing evaluation model fine-tuned from Qwen2.5-VL-7B-Instruct, along with inference & evaluation code demo.

For more information, code, and datasets, visit the official GitHub repository: DOCR-Inspector GitHub. The associated dataset can be found here: DOCRcase-Datasets


πŸ” Introduction

DOCR-Inspector is a Vision-Language Model (VLM) designed for quality inspection of document parsing elements. It takes document element images and their corresponding parsing results as input, detects errors in the parsed content, categorizes them into 28 fine-grained error types, and delivers detailed quality assessment feedback. This approach formalizes document parsing assessment as fine-grained error detection and analysis, leveraging a VLM-as-a-Judge paradigm.

🌟 Key Features

  • No Ground Truth Needed β€” Evaluates parsing results directly, enabling scalable real-world document quality assessment.
  • 28 Fine-grained Error Types β€” Covers text, tables, formulas with multi-level error granularity.
  • Reliable Quality Judgement β€” Equipped with Chain-of-Checklist (CoCL) reasoning, ensuring robust error discovery & explainable evaluation reports.

🧩 Examples

The GitHub repository provides detailed examples for Text, Table, and Equation elements, showcasing the model's fine-grained error detection and quality assessment. You can find full examples at: DOCR-Inspector Examples

πŸ“ Full definition of error types available at: assets/error_type_definition.json

πŸ“Š DOCRcase-200K & DOCRcaseBench

DOCRcase-200K

DOCRcase-200K is a large-scale dataset designed for fine-grained error detection and analysis. It contains 212K element-level parsing cases spanning 28 error types across text, table and equation elements; each error is paired with detailed reasoning annotations.


DOCRcaseBench

DOCRcaseBench is a high-quality benchmark dataset tailored for evaluating document quality assessment models. It comprises real parsed outputs from several state-of-the-art models, including MinerU2.0-pipeline, PP-StructureV3, GPT-4o, Qwen2.5-VL-7B-Instruct, MonkeyOCR-1.2B-Pro, and MinerU2.0-VLM. These models were selected as they represent strong, yet imperfect, performance across various benchmarks. To ensure a balanced distribution of error types for robust evaluation, we meticulously supplemented the dataset with additional, hand-crafted examples. Every parsing result is annotated with human-verified error types.

The overall composition of DOCRcaseBench by model source is detailed in the table below.

Model Count Percentage
MonkeyOCR-1.2B-Pro 142 16.1%
PP-StructureV3 99 11.2%
MinerU2.0-pipeline 149 16.9%
MinerU2.0-VLM 22 2.5%
GPT-4o 181 20.5%
Qwen2.5-VL-7B-Instruct 105 11.9%
Experts (Hand-crafted/Supplemented) 64 7.3%
Total 882 100.0%

The distribution of document elements (cases) in DOCRcaseBench is summarized below:

Text Table Equation Total
Good Case 39 46 62 147
Bad Case with Single Error 339 141 81 561
Bad Case with Multi Error 70 55 49 174
Total 448 242 192 882

πŸ”₯ Performance

We present the evaluation results of various models on the DOCRcaseBench.

  • F1 of Case: Measures the model's accuracy in the binary classification of output quality (Good/Bad).
  • Recall, Precision, and F1 of Error Type: Quantify the model's performance in detecting and correctly classifying the specific error types within the document parsing results.

DOCR-Inspector-7B achieves state-of-the-art results across all element types.

Model Text Table Equation
Case Error Type Case Error Type Case Error Type
F1 Recall F1 Precision F1 Recall F1 Precision F1 Recall F1 Precision
Proprietary Non-Reasoning Models
GPT-4o w/o CoT 72.0531.6628.828.04 73.6929.8926.3625.03 79.249.3147.246.31
GPT-4o w/ CoT 77.6930.5427.2526.35 81.2334.2329.6428.17 79.3846.4445.4545.4
Gemini 2.5 Flash w/o CoT 84.8943.2929.8825.43 82.2141.9425.9721.29 80.4653.7348.1745.96
Gemini 2.5 Flash w/ CoT 84.7542.2429.7425.69 81.1642.3624.119.25 80.9450.6146.1744.63
Open-source Non-Reasoning Models
Qwen2.5-VL-7B-Instruct w/o CoT 46.1512.2811.9811.83 48.819.4219.4219.42 55.832.8132.8132.81
Qwen2.5-VL-7B-Instruct w/ CoT 38.1712.0511.6411.5 43.4821.5621.7222.11 68.132.2932.1232.03
Qwen2.5-VL-72B-Instruct w/o CoT 82.6828.4924.7423.43 83.5140.9133.9431.03 78.5139.9337.1935.76
Qwen2.5-VL-72B-Instruct w/ CoT 74.5530.9726.2324.56 76.8240.731.7728.43 79.1444.5341.2339.79
Reasoning Models
Qwen3-VL-235B-A22B-Thinking 83.942.0231.1927.46 83.1339.1228.5725.49 78.5640.838.4537.76
Gemini 2.5 Pro Thinking 88.4647.1732.928.16 82.0143.6032.9329.63 77.1953.0448.5847.27
Ours:
DOCR-Inspector-7B 96.4381.0680.2181.03 86.4163.0962.1162.95 85.4274.3973.8174.48

πŸ› οΈ Usage

For more details on installation and usage, please visit the DOCR-Inspector GitHub repository.

Installation

DOCR-Inspector-7B is trained based on Qwen2.5-VL-7B-Instruct, so you can follow the Qwen2.5-VL-7B-Instruct installation guide.

We highly recommend installing vLLM >= 0.7.2 to improve inference speed.

Inference with vLLM

Prepare your element-cropped image and the corresponding parsing results. The required data format should conform to the structure found in ./DOCR-Inspector/demo_data on the GitHub repository.

Then, run the following command to perform inference:

python run_case_inf_vllm.py --model_path ZQTTTT/DOCR-Inspector-7B --image_path /path/to/image --ocr_path /path/to/parsing_result

Evaluation

Download DOCRcase- dataset from DOCRcaseBench. We provide a complete evaluation pipeline that supports inference using DOCR-Inspector, API models, and vLLM.

Component Description Path
vLLM Inference Scripts Run DOCR-Inspector locally bench_inf_DOCR-Inspector.py
vLLM Inference Scripts Run other VLM locally bench_inf_qwenvl_vllm.py
API Evaluation Scripts Evaluate GPT/Gemini etc. bench_inf_api.py
Pre-computed Paper Results Results used in the main paper evaluation/results
Metric Computation Notebook Compute F1/Precision/Recall metrics.ipynb

Acknowledgements

Citation

If you find our work helpful or inspiring, please feel free to cite it:

@misc{zhou2024docrinspector,
      title={DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM},
      author={Yifei Zhou and Qianlan Yang and Kaixiang Lin and Min Bai and Xiong Zhou and Yu-Xiong Wang and Sergey Levine and Erran Li},
      year={2024},
      eprint={2512.10619},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
33
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ZQTTTT/DOCR-Inspector-7B

Quantizations
1 model