DocExplainer / README.md
AlessioChenn's picture
Update README.md
62e7389 verified
|
raw
history blame
5.85 kB
metadata
datasets:
  - letxbe/BoundingDocs
language:
  - en
pipeline_tag: visual-question-answering
tags:
  - Visual-Question-Answering
  - Question-Answering
  - Document
license: apache-2.0

DocExplainerV0: Visual Document QA with Bounding Box Localization

License: CC BY 4.0 arXiv HuggingFace

Model description

DocExplainerV0 is a first-step approach to Visual Document Question Answering with bounding box localization.
Unlike standard VLMs that only provide text-based answers, DocExplainerV0 adds visual evidence through bounding boxes, making model predictions more interpretable.
It is designed as a plug-and-play module to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.

⚠️ Important Note:
This model is not intended as a final solution, but rather as a baseline framework and proof of concept. Its purpose is to highlight the gap between textual accuracy and spatial grounding in current VLMs, and to serve as a foundation for future research on interpretable document understanding.

letxbe ai logo Logo Unifi
---

Model Details

DocExplainerV0 is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:

  1. Question Answering: Any VLM processes the document image and question to generate a textual answer.
  2. Bounding Box Explanation: DocExplainerV0 takes the image, question, and generated answer to predict the coordinates of the supporting evidence.

Model Architecture

DocExplainerV0 builds on SigLIP2 visual and text embeddings.

https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png

Training Procedure

  • Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
  • A regression head outputs normalized coordinates [x1, y1, x2, y2].
  • Backbone: SigLiP2 (frozen).
  • Loss Function: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].

Training Setup

  • Dataset: BoundingDocs v2.0
  • Epochs: 20
  • Optimizer: AdamW (default settings)
  • Hardware: 1 × NVIDIA L40S-1-48G GPU
  • Model Selection: Best checkpoint chosen by highest mean IoU on the validation split.

Quick Start

Here is a simple example of how to use DocExplainer to get an answer and its corresponding bounding box from a document image.

from PIL import Image
import requests
from transformers import AutoModel

# Load example document image
url = "https://datasets-server.huggingface.co/cached-assets/letxbe/BoundingDocs/--/47db6d2b6af0aadfd082591a8445d0f47c3b8d61/--/default/test/7/doc_images/image-1d100e9.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
question = "What is the invoice number?"
answer = "3Y8M2d-846" # generate it with any VLM

explainer = AutoModel.from_pretrained("letxbe/DocExplainerv0", trust_remote_code=True)
bbox = explainer.predict(image, answer)
print(f"Bounding box: {bbox}")  # [x1, y1, x2, y2]

Example Output:

Question: What is the invoice number?
Answer: 3Y8M2d-846

Predicted BBox: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606]

Visualized Answer Location: Invoice with predicted bounding box

Performance

Evaluated on BoundingDocs v2.0 dataset:

Full DocExplainer Pipeline

VLM Model ANLS ↑ IoU ↑
SmolVLM2-2.2b 0.572 0.175
qwen2.5-vl-7b 0.689 0.188

VLM-only Baseline (for comparison)

VLM Model ANLS ↑ IoU ↑
SmolVLM2-2.2b 0.561 0.011
qwen2.5-vl-7b 0.720 0.038
Claude Sonnet 4 0.737 0.031

Limitations

  • Prototype only: Intended as a first approach, not a production-ready solution.
  • Dataset constraints: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.

Citation

If you use this model in your research, please cite:

bibtex@misc{docexplainer2025,
  title={Towards Reliable and Interpretable Document Question Answering via VLMs},
  author={[Your Name]},
  year={2025},
  url={}
}