AlessioChenn commited on
Commit
cedb5e5
·
verified ·
1 Parent(s): 73c54dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -5
README.md CHANGED
@@ -24,14 +24,10 @@ license: apache-2.0
24
 
25
  ## Model description
26
 
27
- DocExplainer is a **first-step approach** to Visual Document Question Answering with bounding box localization.
28
  Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
 
31
- ⚠️ **Important Note**:
32
- This model is **not intended as a final solution**, but rather as a **baseline framework and proof of concept**. Its purpose is to highlight the gap between textual accuracy and spatial grounding in current VLMs, and to serve as a foundation for future research on interpretable document understanding.
33
-
34
-
35
  - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
36
  - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
37
  - **License:** CC-BY-4.0
@@ -165,6 +161,24 @@ Example Output:
165
  </table>
166
 
167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  ## Limitations
169
  - **Prototype only**: Intended as a first approach, not a production-ready solution.
170
  - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.
 
24
 
25
  ## Model description
26
 
27
+ DocExplainer is a an approach to Visual Document Question Answering with bounding box localization.
28
  Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
 
 
 
 
 
31
  - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
32
  - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
33
  - **License:** CC-BY-4.0
 
161
  </table>
162
 
163
 
164
+ ## Performance
165
+ Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
166
+ ### Full DocExplainer Pipeline
167
+
168
+ | VLM Model | ANLS ↑| IoU ↑ |
169
+ | --------------- | ----- | ----- |
170
+ | SmolVLM2-2.2b | 0.572 | 0.175 |
171
+ | qwen2.5-vl-7b | 0.689 | 0.188 |
172
+
173
+
174
+ ### VLM-only Baseline (for comparison)
175
+ | VLM Model | ANLS ↑| IoU ↑ |
176
+ | --------------- | ----- | ----- |
177
+ | SmolVLM2-2.2b | 0.561 | 0.011 |
178
+ | qwen2.5-vl-7b | 0.720 | 0.038 |
179
+ | Claude Sonnet 4 | 0.737 | 0.031 |
180
+
181
+
182
  ## Limitations
183
  - **Prototype only**: Intended as a first approach, not a production-ready solution.
184
  - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.