CoExVQA
Document Visual Question Answering (DocVQA) requires vision–language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Despite strong predictive performance, existing DocVQA systems entangle these two aspects and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework that enforces a grounded reasoning process through a chain-of-explanation design. The model first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. By making both evidence selection and spatial grounding intrinsic to prediction, CoExVQA enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence yields competitive performance while providing transparent and verifiable predictions.
This work requires you to download the repository from github
Usage
- Clone the repo
git clone git@github.com:KjetilIN/CoExVQA.git
- Download the pretrained model
from src.model.model import CoExVQA
model = CoExVQA.from_hf(cache_dir=args.cache_dir).to(device)
model.eval()
- Use it to predict and generate labels
# Get images and their quires
images = batch["images"]
questions = batch["questions"]
# Forward pass
outputs = model(
images=images,
questions=questions,
)
# Get the predicted box and mask
box_pred = outputs["box"]
q_mask = outputs.get("q_mask", None)
# Or just generete the text answers
preds = model.generate_preds(
images=images,
questions=questions,
gen_kwargs=gen_kwargs,
gt_boxes=None,
)
Citation
If you use this code or dataset in your research, please cite appropriate:
@misc{indrehus2026selfexplainabledocumentvisualquestion,
title={Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions},
author={Kjetil Indrehus and Adrian Duric and Changkyu Choi and Ali Ramezani-Kebrya},
year={2026},
eprint={2605.06058},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.06058},
}
This work uses the DocVQA dataset:
@misc{mathew2021docvqadatasetvqadocument,
title={DocVQA: A Dataset for VQA on Document Images},
author={Minesh Mathew and Dimosthenis Karatzas and C. V. Jawahar},
year={2021},
eprint={2007.00398},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2007.00398},
}
Model tree for indrehus/CoExVQA
Base model
google/pix2struct-docvqa-base