Visual Question Answering
Safetensors

CoExVQA

image

Document Visual Question Answering (DocVQA) requires vision–language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Despite strong predictive performance, existing DocVQA systems entangle these two aspects and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework that enforces a grounded reasoning process through a chain-of-explanation design. The model first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. By making both evidence selection and spatial grounding intrinsic to prediction, CoExVQA enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence yields competitive performance while providing transparent and verifiable predictions.

This work requires you to download the repository from github

Usage

  1. Clone the repo
git clone git@github.com:KjetilIN/CoExVQA.git
  1. Download the pretrained model
from src.model.model import CoExVQA

model = CoExVQA.from_hf(cache_dir=args.cache_dir).to(device)
model.eval()
  1. Use it to predict and generate labels
# Get images and their quires 
images = batch["images"]
questions = batch["questions"]

# Forward pass
outputs = model(
    images=images,
    questions=questions,
)

# Get the predicted box and mask 
box_pred = outputs["box"]
q_mask = outputs.get("q_mask", None)


# Or just generete the text answers 
preds = model.generate_preds(
    images=images,
    questions=questions,
    gen_kwargs=gen_kwargs,
    gt_boxes=None,
)

Citation

If you use this code or dataset in your research, please cite appropriate:

@misc{indrehus2026selfexplainabledocumentvisualquestion,
      title={Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions}, 
      author={Kjetil Indrehus and Adrian Duric and Changkyu Choi and Ali Ramezani-Kebrya},
      year={2026},
      eprint={2605.06058},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.06058}, 
}

This work uses the DocVQA dataset:

@misc{mathew2021docvqadatasetvqadocument,
    title={DocVQA: A Dataset for VQA on Document Images},
    author={Minesh Mathew and Dimosthenis Karatzas and C. V. Jawahar},
    year={2021},
    eprint={2007.00398},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2007.00398},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for indrehus/CoExVQA

Finetuned
(4)
this model

Dataset used to train indrehus/CoExVQA

Papers for indrehus/CoExVQA