AlessioChenn commited on
Commit
0a9db5e
·
verified ·
1 Parent(s): cedb5e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -16
README.md CHANGED
@@ -24,7 +24,7 @@ license: apache-2.0
24
 
25
  ## Model description
26
 
27
- DocExplainer is a an approach to Visual Document Question Answering with bounding box localization.
28
  Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
 
@@ -162,22 +162,12 @@ Example Output:
162
 
163
 
164
  ## Performance
165
- Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
166
- ### Full DocExplainer Pipeline
167
-
168
- | VLM Model | ANLS ↑| IoU ↑ |
169
- | --------------- | ----- | ----- |
170
- | SmolVLM2-2.2b | 0.572 | 0.175 |
171
- | qwen2.5-vl-7b | 0.689 | 0.188 |
172
-
173
-
174
- ### VLM-only Baseline (for comparison)
175
- | VLM Model | ANLS ↑| IoU ↑ |
176
- | --------------- | ----- | ----- |
177
- | SmolVLM2-2.2b | 0.561 | 0.011 |
178
- | qwen2.5-vl-7b | 0.720 | 0.038 |
179
- | Claude Sonnet 4 | 0.737 | 0.031 |
180
 
 
 
 
 
 
181
 
182
  ## Limitations
183
  - **Prototype only**: Intended as a first approach, not a production-ready solution.
 
24
 
25
  ## Model description
26
 
27
+ DocExplainer is a an approach to Visual Document Question Answering (Document VQA) with bounding box localization.
28
  Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
 
 
162
 
163
 
164
  ## Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66436107bfa436269e92b10a/53foSyYSG55GgP1yGCyvn.png)
167
+
168
+ Document VQA performance of different models and prompting strategies on the BoundingDocs v2.0 dataset. <br>
169
+ Smol stands for SmolVLM-2.2B, Qwen stands for Qwen2-VL-7B, and D.E. stands for DocExplainerV0. <br>
170
+ The best value is shown in **bold**, the second-best value is <ins>underlined</ins>.
171
 
172
  ## Limitations
173
  - **Prototype only**: Intended as a first approach, not a production-ready solution.