File size: 12,219 Bytes
bc4b326
c6e75bd
 
 
bc4b326
 
c6e75bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
license: cc-by-nc-sa-4.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

This repository hosts **DOCR-Inspector-7B**, a Vision-Language Model (VLM) for fine-grained and automated evaluation of document parsing, as presented in the paper [DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM](https://huggingface.co/papers/2512.10619).

**DOCR-Inspector** is a VLM-based evaluation framework designed to automatically assess document parsing results **without requiring ground-truth annotations**. This repository includes **DOCR-Inspector-7B**, a document parsing evaluation model fine-tuned from *Qwen2.5-VL-7B-Instruct*, along with inference & evaluation code demo.

For more information, code, and datasets, visit the official GitHub repository: [DOCR-Inspector GitHub](https://github.com/ZZZZZQT/DOCR-Inspector).
The associated dataset can be found here: [DOCRcase-Datasets](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets)

<p align="center">
 <img src="https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/intro.png" width="100%"/> <br>
</p>

## πŸ” Introduction
DOCR-Inspector is a Vision-Language Model (VLM) designed for quality inspection of document parsing elements. It takes document element images and their corresponding parsing results as input, detects errors in the parsed content, categorizes them into 28 fine-grained error types, and delivers detailed quality assessment feedback. This approach formalizes document parsing assessment as fine-grained error detection and analysis, leveraging a VLM-as-a-Judge paradigm.

## 🌟 Key Features

- **No Ground Truth Needed** β€” Evaluates parsing results directly, enabling scalable real-world document quality assessment.
- **28 Fine-grained Error Types** β€” Covers text, tables, formulas with multi-level error granularity.
- **Reliable Quality Judgement** β€” Equipped with *Chain-of-Checklist (CoCL)* reasoning, ensuring robust error discovery & explainable evaluation reports.

## 🧩 Examples

The GitHub repository provides detailed examples for Text, Table, and Equation elements, showcasing the model's fine-grained error detection and quality assessment.
You can find full examples at: [DOCR-Inspector Examples](https://github.com/ZZZZZQT/DOCR-Inspector#%EF%B8%8F-examples)

πŸ“ Full definition of error types available at: [assets/error_type_definition.json](https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/error_type_definition.json)

# πŸ“Š DOCRcase-200K & DOCRcaseBench

## DOCRcase-200K
DOCRcase-200K is a large-scale dataset designed for fine-grained error detection and analysis.
It contains 212K element-level parsing cases spanning 28 error types across text, table and equation elements; each error is paired with detailed reasoning annotations.
<p align="center">
 <img src="https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/DOCRcase-200k.png" width="100%"/> <br>
</p>

## [DOCRcaseBench](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets)
DOCRcaseBench is a high-quality benchmark dataset tailored for evaluating document quality assessment models.
It comprises real parsed outputs from several state-of-the-art models, including MinerU2.0-pipeline, PP-StructureV3, GPT-4o, Qwen2.5-VL-7B-Instruct, MonkeyOCR-1.2B-Pro, and MinerU2.0-VLM. These models were selected as they represent strong, yet imperfect, performance across various benchmarks.
To ensure a balanced distribution of error types for robust evaluation, **we meticulously supplemented the dataset with additional, hand-crafted examples**.
**Every parsing result is annotated with human-verified error types.**

The overall composition of DOCRcaseBench by model source is detailed in the table below.

| Model | Count | Percentage |
| :--- | :---: | :---: |
| MonkeyOCR-1.2B-Pro | 142 | 16.1% |
| PP-StructureV3 | 99 | 11.2% |
| MinerU2.0-pipeline | 149 | 16.9% |
| MinerU2.0-VLM | 22 | 2.5% |
| GPT-4o | 181 | 20.5% |
| Qwen2.5-VL-7B-Instruct | 105 | 11.9% |
| Experts (Hand-crafted/Supplemented) | 64 | 7.3% |
| **Total** | **882** | **100.0%** |

The distribution of document elements (cases) in DOCRcaseBench is summarized below:

| | Text | Table | Equation | **Total** |
| :--- | :---: | :---: | :---: | :---: |
| Good Case | 39 | 46 | 62 | 147 |
| Bad Case with Single Error | 339 | 141 | 81 | 561 |
| Bad Case with Multi Error | 70 | 55 | 49 | 174 |
| **Total** | **448** | **242** | **192** | **882** |


# πŸ”₯ Performance
We present the evaluation results of various models on the DOCRcaseBench.

*   **F1 of Case:** Measures the model's accuracy in the binary classification of output quality (Good/Bad).
*   **Recall, Precision, and F1 of Error Type:** Quantify the model's performance in detecting and correctly classifying the specific error types within the document parsing results.

**DOCR-Inspector-7B achieves state-of-the-art results across all element types.**

<table>
    <thead>
        <tr>
            <th rowspan="3">Model</th>
            <th colspan="4">Text</th>
            <th colspan="4">Table</th>
            <th colspan="4">Equation</th>
        </tr>
        <tr>
            <th colspan="2">Case</th>
            <th colspan="2">Error Type</th>
            <th colspan="2">Case</th>
            <th colspan="2">Error Type</th>
            <th colspan="2">Case</th>
            <th colspan="2">Error Type</th>
        </tr>
        <tr>
            <th>F1</th>
            <th>Recall</th>
            <th>F1</th>
            <th>Precision</th>
            <th>F1</th>
            <th>Recall</th>
            <th>F1</th>
            <th>Precision</th>
            <th>F1</th>
            <th>Recall</th>
            <th>F1</th>
            <th>Precision</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th colspan="13">Proprietary Non-Reasoning Models</th>
        </tr>
        <tr>
            <td>GPT-4o w/o CoT</td>
            <td>72.05</td><td>31.66</td><td>28.8</td><td>28.04</td>
            <td>73.69</td><td>29.89</td><td>26.36</td><td>25.03</td>
            <td>79.2</td><td>49.31</td><td>47.2</td><td>46.31</td>
        </tr>
        <tr>
            <td>GPT-4o w/ CoT</td>
            <td>77.69</td><td>30.54</td><td>27.25</td><td>26.35</td>
            <td>81.23</td><td>34.23</td><td>29.64</td><td>28.17</td>
            <td>79.38</td><td>46.44</td><td>45.45</td><td>45.4</td>
        </tr>
        <tr>
            <td>Gemini 2.5 Flash w/o CoT</td>
            <td>84.89</td><td>43.29</td><td>29.88</td><td>25.43</td>
            <td>82.21</td><td>41.94</td><td>25.97</td><td>21.29</td>
            <td>80.46</td><td>53.73</td><td>48.17</td><td>45.96</td>
        </tr>
        <tr>
            <td>Gemini 2.5 Flash w/ CoT</td>
            <td>84.75</td><td>42.24</td><td>29.74</td><td>25.69</td>
            <td>81.16</td><td>42.36</td><td>24.1</td><td>19.25</td>
            <td><ins>80.94</ins></td><td>50.61</td><td>46.17</td><td>44.63</td>
        </tr>
        <tr>
            <th colspan="13">Open-source Non-Reasoning Models</th>
        </tr>
        <tr>
            <td>Qwen2.5-VL-7B-Instruct w/o CoT</td>
            <td>46.15</td><td>12.28</td><td>11.98</td><td>11.83</td>
            <td>48.8</td><td>19.42</td><td>19.42</td><td>19.42</td>
            <td>55.8</td><td>32.81</td><td>32.81</td><td>32.81</td>
        </tr>
        <tr>
            <td>Qwen2.5-VL-7B-Instruct w/ CoT</td>
            <td>38.17</td><td>12.05</td><td>11.64</td><td>11.5</td>
            <td>43.48</td><td>21.56</td><td>21.72</td><td>22.11</td>
            <td>68.1</td><td>32.29</td><td>32.12</td><td>32.03</td>
        </tr>
        <tr>
            <td>Qwen2.5-VL-72B-Instruct w/o CoT</td>
            <td>82.68</td><td>28.49</td><td>24.74</td><td>23.43</td>
            <td>83.51</td><td>40.91</td><td>33.94</td><td>31.03</td>
            <td>78.51</td><td>39.93</td><td>37.19</td><td>35.76</td>
        </tr>
        <tr>
            <td>Qwen2.5-VL-72B-Instruct w/ CoT</td>
            <td>74.55</td><td>30.97</td><td>26.23</td><td>24.56</td>
            <td>76.82</td><td>40.7</td><td>31.77</td><td>28.43</td>
            <td>79.14</td><td>44.53</td><td>41.23</td><td>39.79</td>
        </tr>
        <tr>
            <th colspan="13">Reasoning Models</th>
        </tr>
        <tr>
            <td>Qwen3-VL-235B-A22B-Thinking</td>
            <td>83.9</td><td>42.02</td><td>31.19</td><td>27.46</td>
            <td>83.13</td><td>39.12</td><td>28.57</td><td>25.49</td>
            <td>78.56</td><td>40.8</td><td>38.45</td><td>37.76</td>
        </tr>
        <tr>
            <td>Gemini 2.5 Pro Thinking</td>
            <td>88.46</td><td>47.17</td><td>32.9</td><td>28.16</td>
            <td>82.01</td><td><ins>43.60</ins></td><td>32.93</td><td>29.63</td>
            <td>77.19</td><td>53.04</td><td>48.58</td><td><ins>47.27</ins></td>
        </tr>
        <tr>
            <th colspan="13">Ours:</th>
        </tr>
        <tr class="ours-row">
            <td><strong>DOCR-Inspector-7B</strong></td>
            <td><strong>96.43</strong></td><td><strong>81.06</strong></td><td><strong>80.21</strong></td><td><strong>81.03</strong></td>
            <td><strong>86.41</strong></td><td><strong>63.09</strong></td><td><strong>62.11</strong></td><td><strong>62.95</strong></td>
            <td><strong>85.42</strong></td><td><strong>74.39</strong></td><td><strong>73.81</strong></td><td><strong>74.48</strong></td>
        </tr>
    </tbody>
</table>

# πŸ› οΈ Usage

For more details on installation and usage, please visit the [DOCR-Inspector GitHub repository](https://github.com/ZZZZZQT/DOCR-Inspector).

## Installation

DOCR-Inspector-7B is trained based on Qwen2.5-VL-7B-Instruct, so you can follow the [Qwen2.5-VL-7B-Instruct installation guide](https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#quickstart).

We highly recommend installing [`vLLM >= 0.7.2`](https://github.com/vllm-project/vllm) to improve inference speed.

## Inference with vLLM

Prepare your element-cropped image and the corresponding parsing results. The required data format should conform to the structure found in `./DOCR-Inspector/demo_data` on the GitHub repository.

Then, run the following command to perform inference:
```bash
python run_case_inf_vllm.py --model_path ZQTTTT/DOCR-Inspector-7B --image_path /path/to/image --ocr_path /path/to/parsing_result
```

## Evaluation

Download DOCRcase- dataset from [DOCRcaseBench](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets).
We provide a complete evaluation pipeline that supports inference using **DOCR-Inspector**, **API models**, and **vLLM**.

| Component | Description | Path |
|---|---|---|
| vLLM Inference Scripts | Run DOCR-Inspector locally | [`bench_inf_DOCR-Inspector.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_DOCR-Inspector.py) |
| vLLM Inference Scripts | Run other VLM locally | [`bench_inf_qwenvl_vllm.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_qwenvl_vllm.py) |
| API Evaluation Scripts | Evaluate GPT/Gemini etc. | [`bench_inf_api.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_api.py) |
| Pre-computed Paper Results | Results used in the main paper | [`evaluation/results`](https://github.com/ZZZZZQT/DOCR-Inspector/tree/main/evaluation/results/) |
| Metric Computation Notebook | Compute F1/Precision/Recall | [`metrics.ipynb`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/metrics/metrics.ipynb) |

## Acknowledgements
- [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl)
- [Omnidocbench](https://github.com/opendatalab/OmniDocBench)

# Citation
If you find our work helpful or inspiring, please feel free to cite it:
```bibtex
@misc{zhou2024docrinspector,
      title={DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM},
      author={Yifei Zhou and Qianlan Yang and Kaixiang Lin and Min Bai and Xiong Zhou and Yu-Xiong Wang and Sergey Levine and Erran Li},
      year={2024},
      eprint={2512.10619},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```