letxbe
/

DocExplainer

Visual Question Answering

Visual-Question-Answering

Question-Answering

Model card Files Files and versions

AlessioChenn commited on Aug 29, 2025

Commit

0a9db5e

·

verified ·

1 Parent(s): cedb5e5

Update README.md

Files changed (1) hide show

README.md +6 -16

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ license: apache-2.0
 ## Model description
-DocExplainer is a an approach to Visual Document Question Answering with bounding box localization.
 Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
@@ -162,22 +162,12 @@ Example Output:
 ## Performance
-Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
-### Full DocExplainer Pipeline
-| VLM Model       | ANLS ↑| IoU ↑ |
-| --------------- | ----- | ----- |
-| SmolVLM2-2.2b   | 0.572 | 0.175 |
-| qwen2.5-vl-7b   | 0.689 | 0.188 |
-### VLM-only Baseline (for comparison)
-| VLM Model       | ANLS ↑| IoU ↑ |
-| --------------- | ----- | ----- |
-| SmolVLM2-2.2b   | 0.561 | 0.011 |
-| qwen2.5-vl-7b   | 0.720 | 0.038 |
-| Claude Sonnet 4 | 0.737 | 0.031 |
 ## Limitations
 - **Prototype only**: Intended as a first approach, not a production-ready solution.

 ## Model description
+DocExplainer is a an approach to Visual Document Question Answering (Document VQA) with bounding box localization.
 Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
 ## Performance
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/66436107bfa436269e92b10a/53foSyYSG55GgP1yGCyvn.png)
+Document VQA performance of different models and prompting strategies on the BoundingDocs v2.0 dataset. <br>
+Smol stands for SmolVLM-2.2B, Qwen stands for Qwen2-VL-7B, and D.E. stands for DocExplainerV0.  <br>
+The best value is shown in **bold**, the second-best value is <ins>underlined</ins>.
 ## Limitations
 - **Prototype only**: Intended as a first approach, not a production-ready solution.