Update README.md
Browse files
README.md
CHANGED
|
@@ -24,14 +24,10 @@ license: apache-2.0
|
|
| 24 |
|
| 25 |
## Model description
|
| 26 |
|
| 27 |
-
DocExplainer is a
|
| 28 |
Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
|
| 29 |
It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
|
| 30 |
|
| 31 |
-
⚠️ **Important Note**:
|
| 32 |
-
This model is **not intended as a final solution**, but rather as a **baseline framework and proof of concept**. Its purpose is to highlight the gap between textual accuracy and spatial grounding in current VLMs, and to serve as a foundation for future research on interpretable document understanding.
|
| 33 |
-
|
| 34 |
-
|
| 35 |
- **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
|
| 36 |
- **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
|
| 37 |
- **License:** CC-BY-4.0
|
|
@@ -165,6 +161,24 @@ Example Output:
|
|
| 165 |
</table>
|
| 166 |
|
| 167 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
## Limitations
|
| 169 |
- **Prototype only**: Intended as a first approach, not a production-ready solution.
|
| 170 |
- **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.
|
|
|
|
| 24 |
|
| 25 |
## Model description
|
| 26 |
|
| 27 |
+
DocExplainer is a an approach to Visual Document Question Answering with bounding box localization.
|
| 28 |
Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
|
| 29 |
It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
- **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
|
| 32 |
- **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
|
| 33 |
- **License:** CC-BY-4.0
|
|
|
|
| 161 |
</table>
|
| 162 |
|
| 163 |
|
| 164 |
+
## Performance
|
| 165 |
+
Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
|
| 166 |
+
### Full DocExplainer Pipeline
|
| 167 |
+
|
| 168 |
+
| VLM Model | ANLS ↑| IoU ↑ |
|
| 169 |
+
| --------------- | ----- | ----- |
|
| 170 |
+
| SmolVLM2-2.2b | 0.572 | 0.175 |
|
| 171 |
+
| qwen2.5-vl-7b | 0.689 | 0.188 |
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
### VLM-only Baseline (for comparison)
|
| 175 |
+
| VLM Model | ANLS ↑| IoU ↑ |
|
| 176 |
+
| --------------- | ----- | ----- |
|
| 177 |
+
| SmolVLM2-2.2b | 0.561 | 0.011 |
|
| 178 |
+
| qwen2.5-vl-7b | 0.720 | 0.038 |
|
| 179 |
+
| Claude Sonnet 4 | 0.737 | 0.031 |
|
| 180 |
+
|
| 181 |
+
|
| 182 |
## Limitations
|
| 183 |
- **Prototype only**: Intended as a first approach, not a production-ready solution.
|
| 184 |
- **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.
|