V-Retrver
/

V-Retrver-RFT-7B

Model card Files Files and versions

V-Retrver-RFT-7B / README.md

nielsr's picture

nielsr HF Staff

Add model card and metadata

52bca89 verified 8 days ago

|

1.83 kB

	---
	pipeline_tag: image-text-to-text
	---

	# V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

	[V-Retrver](https://huggingface.co/papers/2602.06034) is an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection.

	## About V-Retrver

	V-Retrver enables a Multimodal Large Language Model (MLLM) to selectively acquire visual evidence during reasoning via external visual tools. It performs a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.

	To train this agent, the authors adopted a curriculum-based learning strategy combining:
	- Supervised Reasoning Activation: Initial activation of retrieval-specific reasoning.
	- Rejection-Based Refinement: Improving reasoning reliability via rejection sampling.
	- Reinforcement Learning: Fine-tuning with an evidence-aligned objective.

	Experiments across multiple benchmarks demonstrate significant improvements in retrieval accuracy (averaging 23.0%), perception-driven reasoning reliability, and generalization.

	## Resources

	- Paper: [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://huggingface.co/papers/2602.06034)
	- GitHub Repository: [https://github.com/chendy25/V-Retrver](https://github.com/chendy25/V-Retrver)

	## Citation

	If you find this work helpful, please consider citing:

	```bibtex
	@article{chen2026vretrver,
	title={V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval},
	author={Chen, Dongyang and Wang, Chaoyang and Su, Dezhao and Xiao, Xi and Zhang, Zeyu and Xiong, Jing and Li, Qing and Shang, Yuzhang and Ka, Shichao},
	journal={arXiv preprint arXiv:2602.06034},
	year={2026}
	}
	```