File size: 1,826 Bytes
52bca89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
pipeline_tag: image-text-to-text
---

# V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

[V-Retrver](https://huggingface.co/papers/2602.06034) is an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection.

## About V-Retrver

V-Retrver enables a Multimodal Large Language Model (MLLM) to selectively acquire visual evidence during reasoning via external visual tools. It performs a **multimodal interleaved reasoning** process that alternates between hypothesis generation and targeted visual verification. 

To train this agent, the authors adopted a curriculum-based learning strategy combining:
- **Supervised Reasoning Activation:** Initial activation of retrieval-specific reasoning.
- **Rejection-Based Refinement:** Improving reasoning reliability via rejection sampling.
- **Reinforcement Learning:** Fine-tuning with an evidence-aligned objective.

Experiments across multiple benchmarks demonstrate significant improvements in retrieval accuracy (averaging 23.0%), perception-driven reasoning reliability, and generalization.

## Resources

- **Paper:** [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://huggingface.co/papers/2602.06034)
- **GitHub Repository:** [https://github.com/chendy25/V-Retrver](https://github.com/chendy25/V-Retrver)

## Citation

If you find this work helpful, please consider citing:

```bibtex
@article{chen2026vretrver,
  title={V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval},
  author={Chen, Dongyang and Wang, Chaoyang and Su, Dezhao and Xiao, Xi and Zhang, Zeyu and Xiong, Jing and Li, Qing and Shang, Yuzhang and Ka, Shichao},
  journal={arXiv preprint arXiv:2602.06034},
  year={2026}
}
```