DOCR-Inspector-7B / README.md

Improve model card: add metadata, paper/GitHub links, performance, and usage

bcc8697 verified 14 days ago

12.2 kB

	---
	license: cc-by-nc-sa-4.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

	This repository hosts DOCR-Inspector-7B, a Vision-Language Model (VLM) for fine-grained and automated evaluation of document parsing, as presented in the paper [DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM](https://huggingface.co/papers/2512.10619).

	DOCR-Inspector is a VLM-based evaluation framework designed to automatically assess document parsing results without requiring ground-truth annotations. This repository includes DOCR-Inspector-7B, a document parsing evaluation model fine-tuned from Qwen2.5-VL-7B-Instruct, along with inference & evaluation code demo.

	For more information, code, and datasets, visit the official GitHub repository: [DOCR-Inspector GitHub](https://github.com/ZZZZZQT/DOCR-Inspector).
	The associated dataset can be found here: [DOCRcase-Datasets](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets)

	<p align="center">
	<img src="https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/intro.png" width="100%"/> <br>
	</p>

	## 🔍 Introduction
	DOCR-Inspector is a Vision-Language Model (VLM) designed for quality inspection of document parsing elements. It takes document element images and their corresponding parsing results as input, detects errors in the parsed content, categorizes them into 28 fine-grained error types, and delivers detailed quality assessment feedback. This approach formalizes document parsing assessment as fine-grained error detection and analysis, leveraging a VLM-as-a-Judge paradigm.

	## 🌟 Key Features

	- No Ground Truth Needed — Evaluates parsing results directly, enabling scalable real-world document quality assessment.
	- 28 Fine-grained Error Types — Covers text, tables, formulas with multi-level error granularity.
	- Reliable Quality Judgement — Equipped with Chain-of-Checklist (CoCL) reasoning, ensuring robust error discovery & explainable evaluation reports.

	## 🧩 Examples

	The GitHub repository provides detailed examples for Text, Table, and Equation elements, showcasing the model's fine-grained error detection and quality assessment.
	You can find full examples at: [DOCR-Inspector Examples](https://github.com/ZZZZZQT/DOCR-Inspector#%EF%B8%8F-examples)

	📁 Full definition of error types available at: [assets/error_type_definition.json](https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/error_type_definition.json)

	# 📊 DOCRcase-200K & DOCRcaseBench

	## DOCRcase-200K
	DOCRcase-200K is a large-scale dataset designed for fine-grained error detection and analysis.
	It contains 212K element-level parsing cases spanning 28 error types across text, table and equation elements; each error is paired with detailed reasoning annotations.
	<p align="center">
	<img src="https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/DOCRcase-200k.png" width="100%"/> <br>
	</p>

	## [DOCRcaseBench](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets)
	DOCRcaseBench is a high-quality benchmark dataset tailored for evaluating document quality assessment models.
	It comprises real parsed outputs from several state-of-the-art models, including MinerU2.0-pipeline, PP-StructureV3, GPT-4o, Qwen2.5-VL-7B-Instruct, MonkeyOCR-1.2B-Pro, and MinerU2.0-VLM. These models were selected as they represent strong, yet imperfect, performance across various benchmarks.
	To ensure a balanced distribution of error types for robust evaluation, we meticulously supplemented the dataset with additional, hand-crafted examples.
	Every parsing result is annotated with human-verified error types.

	The overall composition of DOCRcaseBench by model source is detailed in the table below.

	\| Model \| Count \| Percentage \|
	\| :--- \| :---: \| :---: \|
	\| MonkeyOCR-1.2B-Pro \| 142 \| 16.1% \|
	\| PP-StructureV3 \| 99 \| 11.2% \|
	\| MinerU2.0-pipeline \| 149 \| 16.9% \|
	\| MinerU2.0-VLM \| 22 \| 2.5% \|
	\| GPT-4o \| 181 \| 20.5% \|
	\| Qwen2.5-VL-7B-Instruct \| 105 \| 11.9% \|
	\| Experts (Hand-crafted/Supplemented) \| 64 \| 7.3% \|
	\| Total \| 882 \| 100.0% \|

	The distribution of document elements (cases) in DOCRcaseBench is summarized below:

	\| \| Text \| Table \| Equation \| Total \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| Good Case \| 39 \| 46 \| 62 \| 147 \|
	\| Bad Case with Single Error \| 339 \| 141 \| 81 \| 561 \|
	\| Bad Case with Multi Error \| 70 \| 55 \| 49 \| 174 \|
	\| Total \| 448 \| 242 \| 192 \| 882 \|


	# 🔥 Performance
	We present the evaluation results of various models on the DOCRcaseBench.

	* F1 of Case: Measures the model's accuracy in the binary classification of output quality (Good/Bad).
	* Recall, Precision, and F1 of Error Type: Quantify the model's performance in detecting and correctly classifying the specific error types within the document parsing results.

	DOCR-Inspector-7B achieves state-of-the-art results across all element types.

	<table>
	<thead>
	<tr>
	<th rowspan="3">Model</th>
	<th colspan="4">Text</th>
	<th colspan="4">Table</th>
	<th colspan="4">Equation</th>
	</tr>
	<tr>
	<th colspan="2">Case</th>
	<th colspan="2">Error Type</th>
	<th colspan="2">Case</th>
	<th colspan="2">Error Type</th>
	<th colspan="2">Case</th>
	<th colspan="2">Error Type</th>
	</tr>
	<tr>
	<th>F1</th>
	<th>Recall</th>
	<th>F1</th>
	<th>Precision</th>
	<th>F1</th>
	<th>Recall</th>
	<th>F1</th>
	<th>Precision</th>
	<th>F1</th>
	<th>Recall</th>
	<th>F1</th>
	<th>Precision</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<th colspan="13">Proprietary Non-Reasoning Models</th>
	</tr>
	<tr>
	<td>GPT-4o w/o CoT</td>
	<td>72.05</td><td>31.66</td><td>28.8</td><td>28.04</td>
	<td>73.69</td><td>29.89</td><td>26.36</td><td>25.03</td>
	<td>79.2</td><td>49.31</td><td>47.2</td><td>46.31</td>
	</tr>
	<tr>
	<td>GPT-4o w/ CoT</td>
	<td>77.69</td><td>30.54</td><td>27.25</td><td>26.35</td>
	<td>81.23</td><td>34.23</td><td>29.64</td><td>28.17</td>
	<td>79.38</td><td>46.44</td><td>45.45</td><td>45.4</td>
	</tr>
	<tr>
	<td>Gemini 2.5 Flash w/o CoT</td>
	<td>84.89</td><td>43.29</td><td>29.88</td><td>25.43</td>
	<td>82.21</td><td>41.94</td><td>25.97</td><td>21.29</td>
	<td>80.46</td><td>53.73</td><td>48.17</td><td>45.96</td>
	</tr>
	<tr>
	<td>Gemini 2.5 Flash w/ CoT</td>
	<td>84.75</td><td>42.24</td><td>29.74</td><td>25.69</td>
	<td>81.16</td><td>42.36</td><td>24.1</td><td>19.25</td>
	<td><ins>80.94</ins></td><td>50.61</td><td>46.17</td><td>44.63</td>
	</tr>
	<tr>
	<th colspan="13">Open-source Non-Reasoning Models</th>
	</tr>
	<tr>
	<td>Qwen2.5-VL-7B-Instruct w/o CoT</td>
	<td>46.15</td><td>12.28</td><td>11.98</td><td>11.83</td>
	<td>48.8</td><td>19.42</td><td>19.42</td><td>19.42</td>
	<td>55.8</td><td>32.81</td><td>32.81</td><td>32.81</td>
	</tr>
	<tr>
	<td>Qwen2.5-VL-7B-Instruct w/ CoT</td>
	<td>38.17</td><td>12.05</td><td>11.64</td><td>11.5</td>
	<td>43.48</td><td>21.56</td><td>21.72</td><td>22.11</td>
	<td>68.1</td><td>32.29</td><td>32.12</td><td>32.03</td>
	</tr>
	<tr>
	<td>Qwen2.5-VL-72B-Instruct w/o CoT</td>
	<td>82.68</td><td>28.49</td><td>24.74</td><td>23.43</td>
	<td>83.51</td><td>40.91</td><td>33.94</td><td>31.03</td>
	<td>78.51</td><td>39.93</td><td>37.19</td><td>35.76</td>
	</tr>
	<tr>
	<td>Qwen2.5-VL-72B-Instruct w/ CoT</td>
	<td>74.55</td><td>30.97</td><td>26.23</td><td>24.56</td>
	<td>76.82</td><td>40.7</td><td>31.77</td><td>28.43</td>
	<td>79.14</td><td>44.53</td><td>41.23</td><td>39.79</td>
	</tr>
	<tr>
	<th colspan="13">Reasoning Models</th>
	</tr>
	<tr>
	<td>Qwen3-VL-235B-A22B-Thinking</td>
	<td>83.9</td><td>42.02</td><td>31.19</td><td>27.46</td>
	<td>83.13</td><td>39.12</td><td>28.57</td><td>25.49</td>
	<td>78.56</td><td>40.8</td><td>38.45</td><td>37.76</td>
	</tr>
	<tr>
	<td>Gemini 2.5 Pro Thinking</td>
	<td>88.46</td><td>47.17</td><td>32.9</td><td>28.16</td>
	<td>82.01</td><td><ins>43.60</ins></td><td>32.93</td><td>29.63</td>
	<td>77.19</td><td>53.04</td><td>48.58</td><td><ins>47.27</ins></td>
	</tr>
	<tr>
	<th colspan="13">Ours:</th>
	</tr>
	<tr class="ours-row">
	<td><strong>DOCR-Inspector-7B</strong></td>
	<td><strong>96.43</strong></td><td><strong>81.06</strong></td><td><strong>80.21</strong></td><td><strong>81.03</strong></td>
	<td><strong>86.41</strong></td><td><strong>63.09</strong></td><td><strong>62.11</strong></td><td><strong>62.95</strong></td>
	<td><strong>85.42</strong></td><td><strong>74.39</strong></td><td><strong>73.81</strong></td><td><strong>74.48</strong></td>
	</tr>
	</tbody>
	</table>

	# 🛠️ Usage

	For more details on installation and usage, please visit the [DOCR-Inspector GitHub repository](https://github.com/ZZZZZQT/DOCR-Inspector).

	## Installation

	DOCR-Inspector-7B is trained based on Qwen2.5-VL-7B-Instruct, so you can follow the [Qwen2.5-VL-7B-Instruct installation guide](https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#quickstart).

	We highly recommend installing [`vLLM >= 0.7.2`](https://github.com/vllm-project/vllm) to improve inference speed.

	## Inference with vLLM

	Prepare your element-cropped image and the corresponding parsing results. The required data format should conform to the structure found in `./DOCR-Inspector/demo_data` on the GitHub repository.

	Then, run the following command to perform inference:
	```bash
	python run_case_inf_vllm.py --model_path ZQTTTT/DOCR-Inspector-7B --image_path /path/to/image --ocr_path /path/to/parsing_result
	```

	## Evaluation

	Download DOCRcase- dataset from [DOCRcaseBench](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets).
	We provide a complete evaluation pipeline that supports inference using DOCR-Inspector, API models, and vLLM.

	\| Component \| Description \| Path \|
	\|---\|---\|---\|
	\| vLLM Inference Scripts \| Run DOCR-Inspector locally \| [`bench_inf_DOCR-Inspector.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_DOCR-Inspector.py) \|
	\| vLLM Inference Scripts \| Run other VLM locally \| [`bench_inf_qwenvl_vllm.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_qwenvl_vllm.py) \|
	\| API Evaluation Scripts \| Evaluate GPT/Gemini etc. \| [`bench_inf_api.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_api.py) \|
	\| Pre-computed Paper Results \| Results used in the main paper \| [`evaluation/results`](https://github.com/ZZZZZQT/DOCR-Inspector/tree/main/evaluation/results/) \|
	\| Metric Computation Notebook \| Compute F1/Precision/Recall \| [`metrics.ipynb`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/metrics/metrics.ipynb) \|

	## Acknowledgements
	- [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl)
	- [Omnidocbench](https://github.com/opendatalab/OmniDocBench)

	# Citation
	If you find our work helpful or inspiring, please feel free to cite it:
	```bibtex
	@misc{zhou2024docrinspector,
	title={DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM},
	author={Yifei Zhou and Qianlan Yang and Kaixiang Lin and Min Bai and Xiong Zhou and Yu-Xiong Wang and Sergey Levine and Erran Li},
	year={2024},
	eprint={2512.10619},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```