Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ license: apache-2.0
|
|
| 24 |
|
| 25 |
## Model description
|
| 26 |
|
| 27 |
-
DocExplainer is a an approach to Visual Document Question Answering with bounding box localization.
|
| 28 |
Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
|
| 29 |
It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
|
| 30 |
|
|
@@ -162,22 +162,12 @@ Example Output:
|
|
| 162 |
|
| 163 |
|
| 164 |
## Performance
|
| 165 |
-
Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
|
| 166 |
-
### Full DocExplainer Pipeline
|
| 167 |
-
|
| 168 |
-
| VLM Model | ANLS ↑| IoU ↑ |
|
| 169 |
-
| --------------- | ----- | ----- |
|
| 170 |
-
| SmolVLM2-2.2b | 0.572 | 0.175 |
|
| 171 |
-
| qwen2.5-vl-7b | 0.689 | 0.188 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
### VLM-only Baseline (for comparison)
|
| 175 |
-
| VLM Model | ANLS ↑| IoU ↑ |
|
| 176 |
-
| --------------- | ----- | ----- |
|
| 177 |
-
| SmolVLM2-2.2b | 0.561 | 0.011 |
|
| 178 |
-
| qwen2.5-vl-7b | 0.720 | 0.038 |
|
| 179 |
-
| Claude Sonnet 4 | 0.737 | 0.031 |
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
## Limitations
|
| 183 |
- **Prototype only**: Intended as a first approach, not a production-ready solution.
|
|
|
|
| 24 |
|
| 25 |
## Model description
|
| 26 |
|
| 27 |
+
DocExplainer is a an approach to Visual Document Question Answering (Document VQA) with bounding box localization.
|
| 28 |
Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
|
| 29 |
It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
|
| 30 |
|
|
|
|
| 162 |
|
| 163 |
|
| 164 |
## Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
+

|
| 167 |
+
|
| 168 |
+
Document VQA performance of different models and prompting strategies on the BoundingDocs v2.0 dataset. <br>
|
| 169 |
+
Smol stands for SmolVLM-2.2B, Qwen stands for Qwen2-VL-7B, and D.E. stands for DocExplainerV0. <br>
|
| 170 |
+
The best value is shown in **bold**, the second-best value is <ins>underlined</ins>.
|
| 171 |
|
| 172 |
## Limitations
|
| 173 |
- **Prototype only**: Intended as a first approach, not a production-ready solution.
|