ZQTTTT nielsr HF Staff commited on
Commit
c6e75bd
·
verified ·
1 Parent(s): bc4b326

Improve model card: add metadata, paper/GitHub links, performance, and usage (#1)

Browse files

- Improve model card: add metadata, paper/GitHub links, performance, and usage (bcc8697d19444d01b723e812ad4074bd9068e959)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +239 -3
README.md CHANGED
@@ -1,6 +1,242 @@
1
  ---
2
- license: apache-2.0
 
 
3
  ---
4
- DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
5
 
6
- Visit our GitHub repository at [DOCR-Inspector](...) for more information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-sa-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
 
6
 
7
+ # DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
8
+
9
+ This repository hosts **DOCR-Inspector-7B**, a Vision-Language Model (VLM) for fine-grained and automated evaluation of document parsing, as presented in the paper [DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM](https://huggingface.co/papers/2512.10619).
10
+
11
+ **DOCR-Inspector** is a VLM-based evaluation framework designed to automatically assess document parsing results **without requiring ground-truth annotations**. This repository includes **DOCR-Inspector-7B**, a document parsing evaluation model fine-tuned from *Qwen2.5-VL-7B-Instruct*, along with inference & evaluation code demo.
12
+
13
+ For more information, code, and datasets, visit the official GitHub repository: [DOCR-Inspector GitHub](https://github.com/ZZZZZQT/DOCR-Inspector).
14
+ The associated dataset can be found here: [DOCRcase-Datasets](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets)
15
+
16
+ <p align="center">
17
+ <img src="https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/intro.png" width="100%"/> <br>
18
+ </p>
19
+
20
+ ## 🔍 Introduction
21
+ DOCR-Inspector is a Vision-Language Model (VLM) designed for quality inspection of document parsing elements. It takes document element images and their corresponding parsing results as input, detects errors in the parsed content, categorizes them into 28 fine-grained error types, and delivers detailed quality assessment feedback. This approach formalizes document parsing assessment as fine-grained error detection and analysis, leveraging a VLM-as-a-Judge paradigm.
22
+
23
+ ## 🌟 Key Features
24
+
25
+ - **No Ground Truth Needed** — Evaluates parsing results directly, enabling scalable real-world document quality assessment.
26
+ - **28 Fine-grained Error Types** — Covers text, tables, formulas with multi-level error granularity.
27
+ - **Reliable Quality Judgement** — Equipped with *Chain-of-Checklist (CoCL)* reasoning, ensuring robust error discovery & explainable evaluation reports.
28
+
29
+ ## 🧩 Examples
30
+
31
+ The GitHub repository provides detailed examples for Text, Table, and Equation elements, showcasing the model's fine-grained error detection and quality assessment.
32
+ You can find full examples at: [DOCR-Inspector Examples](https://github.com/ZZZZZQT/DOCR-Inspector#%EF%B8%8F-examples)
33
+
34
+ 📁 Full definition of error types available at: [assets/error_type_definition.json](https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/error_type_definition.json)
35
+
36
+ # 📊 DOCRcase-200K & DOCRcaseBench
37
+
38
+ ## DOCRcase-200K
39
+ DOCRcase-200K is a large-scale dataset designed for fine-grained error detection and analysis.
40
+ It contains 212K element-level parsing cases spanning 28 error types across text, table and equation elements; each error is paired with detailed reasoning annotations.
41
+ <p align="center">
42
+ <img src="https://github.com/ZZZZZQT/DOCR-Inspector/raw/main/assets/DOCRcase-200k.png" width="100%"/> <br>
43
+ </p>
44
+
45
+ ## [DOCRcaseBench](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets)
46
+ DOCRcaseBench is a high-quality benchmark dataset tailored for evaluating document quality assessment models.
47
+ It comprises real parsed outputs from several state-of-the-art models, including MinerU2.0-pipeline, PP-StructureV3, GPT-4o, Qwen2.5-VL-7B-Instruct, MonkeyOCR-1.2B-Pro, and MinerU2.0-VLM. These models were selected as they represent strong, yet imperfect, performance across various benchmarks.
48
+ To ensure a balanced distribution of error types for robust evaluation, **we meticulously supplemented the dataset with additional, hand-crafted examples**.
49
+ **Every parsing result is annotated with human-verified error types.**
50
+
51
+ The overall composition of DOCRcaseBench by model source is detailed in the table below.
52
+
53
+ | Model | Count | Percentage |
54
+ | :--- | :---: | :---: |
55
+ | MonkeyOCR-1.2B-Pro | 142 | 16.1% |
56
+ | PP-StructureV3 | 99 | 11.2% |
57
+ | MinerU2.0-pipeline | 149 | 16.9% |
58
+ | MinerU2.0-VLM | 22 | 2.5% |
59
+ | GPT-4o | 181 | 20.5% |
60
+ | Qwen2.5-VL-7B-Instruct | 105 | 11.9% |
61
+ | Experts (Hand-crafted/Supplemented) | 64 | 7.3% |
62
+ | **Total** | **882** | **100.0%** |
63
+
64
+ The distribution of document elements (cases) in DOCRcaseBench is summarized below:
65
+
66
+ | | Text | Table | Equation | **Total** |
67
+ | :--- | :---: | :---: | :---: | :---: |
68
+ | Good Case | 39 | 46 | 62 | 147 |
69
+ | Bad Case with Single Error | 339 | 141 | 81 | 561 |
70
+ | Bad Case with Multi Error | 70 | 55 | 49 | 174 |
71
+ | **Total** | **448** | **242** | **192** | **882** |
72
+
73
+
74
+ # 🔥 Performance
75
+ We present the evaluation results of various models on the DOCRcaseBench.
76
+
77
+ * **F1 of Case:** Measures the model's accuracy in the binary classification of output quality (Good/Bad).
78
+ * **Recall, Precision, and F1 of Error Type:** Quantify the model's performance in detecting and correctly classifying the specific error types within the document parsing results.
79
+
80
+ **DOCR-Inspector-7B achieves state-of-the-art results across all element types.**
81
+
82
+ <table>
83
+ <thead>
84
+ <tr>
85
+ <th rowspan="3">Model</th>
86
+ <th colspan="4">Text</th>
87
+ <th colspan="4">Table</th>
88
+ <th colspan="4">Equation</th>
89
+ </tr>
90
+ <tr>
91
+ <th colspan="2">Case</th>
92
+ <th colspan="2">Error Type</th>
93
+ <th colspan="2">Case</th>
94
+ <th colspan="2">Error Type</th>
95
+ <th colspan="2">Case</th>
96
+ <th colspan="2">Error Type</th>
97
+ </tr>
98
+ <tr>
99
+ <th>F1</th>
100
+ <th>Recall</th>
101
+ <th>F1</th>
102
+ <th>Precision</th>
103
+ <th>F1</th>
104
+ <th>Recall</th>
105
+ <th>F1</th>
106
+ <th>Precision</th>
107
+ <th>F1</th>
108
+ <th>Recall</th>
109
+ <th>F1</th>
110
+ <th>Precision</th>
111
+ </tr>
112
+ </thead>
113
+ <tbody>
114
+ <tr>
115
+ <th colspan="13">Proprietary Non-Reasoning Models</th>
116
+ </tr>
117
+ <tr>
118
+ <td>GPT-4o w/o CoT</td>
119
+ <td>72.05</td><td>31.66</td><td>28.8</td><td>28.04</td>
120
+ <td>73.69</td><td>29.89</td><td>26.36</td><td>25.03</td>
121
+ <td>79.2</td><td>49.31</td><td>47.2</td><td>46.31</td>
122
+ </tr>
123
+ <tr>
124
+ <td>GPT-4o w/ CoT</td>
125
+ <td>77.69</td><td>30.54</td><td>27.25</td><td>26.35</td>
126
+ <td>81.23</td><td>34.23</td><td>29.64</td><td>28.17</td>
127
+ <td>79.38</td><td>46.44</td><td>45.45</td><td>45.4</td>
128
+ </tr>
129
+ <tr>
130
+ <td>Gemini 2.5 Flash w/o CoT</td>
131
+ <td>84.89</td><td>43.29</td><td>29.88</td><td>25.43</td>
132
+ <td>82.21</td><td>41.94</td><td>25.97</td><td>21.29</td>
133
+ <td>80.46</td><td>53.73</td><td>48.17</td><td>45.96</td>
134
+ </tr>
135
+ <tr>
136
+ <td>Gemini 2.5 Flash w/ CoT</td>
137
+ <td>84.75</td><td>42.24</td><td>29.74</td><td>25.69</td>
138
+ <td>81.16</td><td>42.36</td><td>24.1</td><td>19.25</td>
139
+ <td><ins>80.94</ins></td><td>50.61</td><td>46.17</td><td>44.63</td>
140
+ </tr>
141
+ <tr>
142
+ <th colspan="13">Open-source Non-Reasoning Models</th>
143
+ </tr>
144
+ <tr>
145
+ <td>Qwen2.5-VL-7B-Instruct w/o CoT</td>
146
+ <td>46.15</td><td>12.28</td><td>11.98</td><td>11.83</td>
147
+ <td>48.8</td><td>19.42</td><td>19.42</td><td>19.42</td>
148
+ <td>55.8</td><td>32.81</td><td>32.81</td><td>32.81</td>
149
+ </tr>
150
+ <tr>
151
+ <td>Qwen2.5-VL-7B-Instruct w/ CoT</td>
152
+ <td>38.17</td><td>12.05</td><td>11.64</td><td>11.5</td>
153
+ <td>43.48</td><td>21.56</td><td>21.72</td><td>22.11</td>
154
+ <td>68.1</td><td>32.29</td><td>32.12</td><td>32.03</td>
155
+ </tr>
156
+ <tr>
157
+ <td>Qwen2.5-VL-72B-Instruct w/o CoT</td>
158
+ <td>82.68</td><td>28.49</td><td>24.74</td><td>23.43</td>
159
+ <td>83.51</td><td>40.91</td><td>33.94</td><td>31.03</td>
160
+ <td>78.51</td><td>39.93</td><td>37.19</td><td>35.76</td>
161
+ </tr>
162
+ <tr>
163
+ <td>Qwen2.5-VL-72B-Instruct w/ CoT</td>
164
+ <td>74.55</td><td>30.97</td><td>26.23</td><td>24.56</td>
165
+ <td>76.82</td><td>40.7</td><td>31.77</td><td>28.43</td>
166
+ <td>79.14</td><td>44.53</td><td>41.23</td><td>39.79</td>
167
+ </tr>
168
+ <tr>
169
+ <th colspan="13">Reasoning Models</th>
170
+ </tr>
171
+ <tr>
172
+ <td>Qwen3-VL-235B-A22B-Thinking</td>
173
+ <td>83.9</td><td>42.02</td><td>31.19</td><td>27.46</td>
174
+ <td>83.13</td><td>39.12</td><td>28.57</td><td>25.49</td>
175
+ <td>78.56</td><td>40.8</td><td>38.45</td><td>37.76</td>
176
+ </tr>
177
+ <tr>
178
+ <td>Gemini 2.5 Pro Thinking</td>
179
+ <td>88.46</td><td>47.17</td><td>32.9</td><td>28.16</td>
180
+ <td>82.01</td><td><ins>43.60</ins></td><td>32.93</td><td>29.63</td>
181
+ <td>77.19</td><td>53.04</td><td>48.58</td><td><ins>47.27</ins></td>
182
+ </tr>
183
+ <tr>
184
+ <th colspan="13">Ours:</th>
185
+ </tr>
186
+ <tr class="ours-row">
187
+ <td><strong>DOCR-Inspector-7B</strong></td>
188
+ <td><strong>96.43</strong></td><td><strong>81.06</strong></td><td><strong>80.21</strong></td><td><strong>81.03</strong></td>
189
+ <td><strong>86.41</strong></td><td><strong>63.09</strong></td><td><strong>62.11</strong></td><td><strong>62.95</strong></td>
190
+ <td><strong>85.42</strong></td><td><strong>74.39</strong></td><td><strong>73.81</strong></td><td><strong>74.48</strong></td>
191
+ </tr>
192
+ </tbody>
193
+ </table>
194
+
195
+ # 🛠️ Usage
196
+
197
+ For more details on installation and usage, please visit the [DOCR-Inspector GitHub repository](https://github.com/ZZZZZQT/DOCR-Inspector).
198
+
199
+ ## Installation
200
+
201
+ DOCR-Inspector-7B is trained based on Qwen2.5-VL-7B-Instruct, so you can follow the [Qwen2.5-VL-7B-Instruct installation guide](https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#quickstart).
202
+
203
+ We highly recommend installing [`vLLM >= 0.7.2`](https://github.com/vllm-project/vllm) to improve inference speed.
204
+
205
+ ## Inference with vLLM
206
+
207
+ Prepare your element-cropped image and the corresponding parsing results. The required data format should conform to the structure found in `./DOCR-Inspector/demo_data` on the GitHub repository.
208
+
209
+ Then, run the following command to perform inference:
210
+ ```bash
211
+ python run_case_inf_vllm.py --model_path ZQTTTT/DOCR-Inspector-7B --image_path /path/to/image --ocr_path /path/to/parsing_result
212
+ ```
213
+
214
+ ## Evaluation
215
+
216
+ Download DOCRcase- dataset from [DOCRcaseBench](https://huggingface.co/datasets/ZQTTTT/DOCRcase-Datasets).
217
+ We provide a complete evaluation pipeline that supports inference using **DOCR-Inspector**, **API models**, and **vLLM**.
218
+
219
+ | Component | Description | Path |
220
+ |---|---|---|
221
+ | vLLM Inference Scripts | Run DOCR-Inspector locally | [`bench_inf_DOCR-Inspector.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_DOCR-Inspector.py) |
222
+ | vLLM Inference Scripts | Run other VLM locally | [`bench_inf_qwenvl_vllm.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_qwenvl_vllm.py) |
223
+ | API Evaluation Scripts | Evaluate GPT/Gemini etc. | [`bench_inf_api.py`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/inf/bench_inf_api.py) |
224
+ | Pre-computed Paper Results | Results used in the main paper | [`evaluation/results`](https://github.com/ZZZZZQT/DOCR-Inspector/tree/main/evaluation/results/) |
225
+ | Metric Computation Notebook | Compute F1/Precision/Recall | [`metrics.ipynb`](https://github.com/ZZZZZQT/DOCR-Inspector/blob/main/evaluation/metrics/metrics.ipynb) |
226
+
227
+ ## Acknowledgements
228
+ - [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl)
229
+ - [Omnidocbench](https://github.com/opendatalab/OmniDocBench)
230
+
231
+ # Citation
232
+ If you find our work helpful or inspiring, please feel free to cite it:
233
+ ```bibtex
234
+ @misc{zhou2024docrinspector,
235
+ title={DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM},
236
+ author={Yifei Zhou and Qianlan Yang and Kaixiang Lin and Min Bai and Xiong Zhou and Yu-Xiong Wang and Sergey Levine and Erran Li},
237
+ year={2024},
238
+ eprint={2512.10619},
239
+ archivePrefix={arXiv},
240
+ primaryClass={cs.CL}
241
+ }
242
+ ```