V-Retrver-RFT-7B / README.md
nielsr's picture
nielsr HF Staff
Add model card and metadata
52bca89 verified
|
raw
history blame
1.83 kB
metadata
pipeline_tag: image-text-to-text

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

V-Retrver is an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection.

About V-Retrver

V-Retrver enables a Multimodal Large Language Model (MLLM) to selectively acquire visual evidence during reasoning via external visual tools. It performs a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.

To train this agent, the authors adopted a curriculum-based learning strategy combining:

  • Supervised Reasoning Activation: Initial activation of retrieval-specific reasoning.
  • Rejection-Based Refinement: Improving reasoning reliability via rejection sampling.
  • Reinforcement Learning: Fine-tuning with an evidence-aligned objective.

Experiments across multiple benchmarks demonstrate significant improvements in retrieval accuracy (averaging 23.0%), perception-driven reasoning reliability, and generalization.

Resources

Citation

If you find this work helpful, please consider citing:

@article{chen2026vretrver,
  title={V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval},
  author={Chen, Dongyang and Wang, Chaoyang and Su, Dezhao and Xiao, Xi and Zhang, Zeyu and Xiong, Jing and Li, Qing and Shang, Yuzhang and Ka, Shichao},
  journal={arXiv preprint arXiv:2602.06034},
  year={2026}
}