ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Hao Yang¹, Yifan Ji¹, Zhipeng Xu¹, Zhenghao Liu¹, Yukun Yan², Zulong Chen³, Shuo Wang², Yu Gu¹, Ge Yu¹

¹Northeastern University, ²Tsinghua University, ³Alibaba Group

Overview

Reasoning-Guided Alignment (ReAlign) is a method that enhances visual document retrieval by leveraging the reasoning capability of Vision-Language Models (VLMs) to provide fine-grained visual document descriptions as supervision signals for training. By identifying query-related regions on a page and generating query-aware descriptions, ReAlign helps the retriever focus on critical visual cues within complex layouts.

This repository contains the visual document retriever based on Qwen2.5-VL-7B-Instruct.

The paper is available at ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment.

Our work is accepted by SIGIR 2026 🎉🎉🎉!

Collections

We have made the following resources available on 🤗ReAlign collection.

Resource	Description	Link
ReAlign-Phi3v	The visual document retriever based on Phi-3-vision-128k-instruct	🤗ReAlign-Phi3v
ReAlign-Qwen	The visual document retriever based on Qwen2.5-VL-7B-Instruct	🤗ReAlign-Qwen
Training Data	The data used to train the ReAlign retriever	🤗ReAlign-Trainset

Setup

For detailed training instructions and data preparation, please refer to the official GitHub repository: ReAlign.

Citation

@article{yang2026realign,
      title={ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment},
      author={Yang, Hao and Ji, Yifan and Xu, Zhipeng and Liu, Zhenghao and Yan, Yukun and Chen, Zulong and Wang, Shuo and Gu, Yu and Yu, Ge},
      year={2026},
      url={https://arxiv.org/abs/2604.07419}, 
}