|
|
--- |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval |
|
|
|
|
|
[V-Retrver](https://huggingface.co/papers/2602.06034) is an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. |
|
|
|
|
|
## About V-Retrver |
|
|
|
|
|
V-Retrver enables a Multimodal Large Language Model (MLLM) to selectively acquire visual evidence during reasoning via external visual tools. It performs a **multimodal interleaved reasoning** process that alternates between hypothesis generation and targeted visual verification. |
|
|
|
|
|
To train this agent, the authors adopted a curriculum-based learning strategy combining: |
|
|
- **Supervised Reasoning Activation:** Initial activation of retrieval-specific reasoning. |
|
|
- **Rejection-Based Refinement:** Improving reasoning reliability via rejection sampling. |
|
|
- **Reinforcement Learning:** Fine-tuning with an evidence-aligned objective. |
|
|
|
|
|
Experiments across multiple benchmarks demonstrate significant improvements in retrieval accuracy (averaging 23.0%), perception-driven reasoning reliability, and generalization. |
|
|
|
|
|
## Resources |
|
|
|
|
|
- **Paper:** [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://huggingface.co/papers/2602.06034) |
|
|
- **GitHub Repository:** [https://github.com/chendy25/V-Retrver](https://github.com/chendy25/V-Retrver) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work helpful, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@article{chen2026vretrver, |
|
|
title={V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval}, |
|
|
author={Chen, Dongyang and Wang, Chaoyang and Su, Dezhao and Xiao, Xi and Zhang, Zeyu and Xiong, Jing and Li, Qing and Shang, Yuzhang and Ka, Shichao}, |
|
|
journal={arXiv preprint arXiv:2602.06034}, |
|
|
year={2026} |
|
|
} |
|
|
``` |