Papers
arxiv:2605.06058

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

Published on May 7
Authors:
,
,
,

Abstract

A self-explainable DocVQA framework named CoExVQA is proposed that uses a chain-of-explanation approach to identify relevant evidence, localize answers, and decode responses from grounded regions, achieving state-of-the-art performance with transparent and verifiable predictions.

AI-generated summary

Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.06058
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.06058 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.