letxbe
/

DocExplainer

Visual Question Answering

Visual-Question-Answering

Question-Answering

Model card Files Files and versions

AlessioChenn commited on Aug 29, 2025

Commit

cedb5e5

·

verified ·

1 Parent(s): 73c54dc

Update README.md

Files changed (1) hide show

README.md +19 -5

README.md CHANGED Viewed

@@ -24,14 +24,10 @@ license: apache-2.0
 ## Model description
-DocExplainer is a **first-step approach** to Visual Document Question Answering with bounding box localization.
 Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
-⚠️ **Important Note**:
-This model is **not intended as a final solution**, but rather as a **baseline framework and proof of concept**. Its purpose is to highlight the gap between textual accuracy and spatial grounding in current VLMs, and to serve as a foundation for future research on interpretable document understanding.
 - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
 - **Affiliations:**  [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
 - **License:** CC-BY-4.0
@@ -165,6 +161,24 @@ Example Output:
 </table>
 ## Limitations
 - **Prototype only**: Intended as a first approach, not a production-ready solution.
 - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.

 ## Model description
+DocExplainer is a an approach to Visual Document Question Answering with bounding box localization.
 Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
 - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
 - **Affiliations:**  [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
 - **License:** CC-BY-4.0
 </table>
+## Performance
+Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
+### Full DocExplainer Pipeline
+| VLM Model       | ANLS ↑| IoU ↑ |
+| --------------- | ----- | ----- |
+| SmolVLM2-2.2b   | 0.572 | 0.175 |
+| qwen2.5-vl-7b   | 0.689 | 0.188 |
+### VLM-only Baseline (for comparison)
+| VLM Model       | ANLS ↑| IoU ↑ |
+| --------------- | ----- | ----- |
+| SmolVLM2-2.2b   | 0.561 | 0.011 |
+| qwen2.5-vl-7b   | 0.720 | 0.038 |
+| Claude Sonnet 4 | 0.737 | 0.031 |
 ## Limitations
 - **Prototype only**: Intended as a first approach, not a production-ready solution.
 - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.