CoExVQA

Document Visual Question Answering (DocVQA) requires vision–language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Despite strong predictive performance, existing DocVQA systems entangle these two aspects and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework that enforces a grounded reasoning process through a chain-of-explanation design. The model first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. By making both evidence selection and spatial grounding intrinsic to prediction, CoExVQA enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence yields competitive performance while providing transparent and verifiable predictions.

This work requires you to download the repository from github

https://github.com/KjetilIN/CoExVQA

Usage

Clone the repo

git clone git@github.com:KjetilIN/CoExVQA.git

Download the pretrained model

from src.model.model import CoExVQA

model = CoExVQA.from_hf(cache_dir=args.cache_dir).to(device)
model.eval()

Use it to predict and generate labels

# Get images and their quires 
images = batch["images"]
questions = batch["questions"]

# Forward pass
outputs = model(
    images=images,
    questions=questions,
)

# Get the predicted box and mask 
box_pred = outputs["box"]
q_mask = outputs.get("q_mask", None)


# Or just generete the text answers 
preds = model.generate_preds(
    images=images,
    questions=questions,
    gen_kwargs=gen_kwargs,
    gt_boxes=None,
)

Citation

If you use this code or dataset in your research, please cite appropriate:

@misc{indrehus2026selfexplainabledocumentvisualquestion,
      title={Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions}, 
      author={Kjetil Indrehus and Adrian Duric and Changkyu Choi and Ali Ramezani-Kebrya},
      year={2026},
      eprint={2605.06058},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.06058}, 
}

This work uses the DocVQA dataset:

@misc{mathew2021docvqadatasetvqadocument,
    title={DocVQA: A Dataset for VQA on Document Images},
    author={Minesh Mathew and Dimosthenis Karatzas and C. V. Jawahar},
    year={2021},
    eprint={2007.00398},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2007.00398},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for indrehus/CoExVQA

Base model

google/pix2struct-docvqa-base

Finetuned

(4)

this model

Dataset used to train indrehus/CoExVQA

Papers for indrehus/CoExVQA

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

Paper • 2605.06058 • Published 4 days ago

DocVQA: A Dataset for VQA on Document Images

Paper • 2007.00398 • Published Jul 1, 2020 • 2