Add model card and metadata
Browse filesThis PR adds a model card for V-Retrver. It includes:
- Metadata for the `image-text-to-text` pipeline tag.
- Links to the research paper and the official GitHub repository.
- A summary of the model's "Evidence-Driven Agentic Reasoning" framework and curriculum-based training strategy.
- BibTeX citation information.
README.md
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-text-to-text
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
|
| 6 |
+
|
| 7 |
+
[V-Retrver](https://huggingface.co/papers/2602.06034) is an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection.
|
| 8 |
+
|
| 9 |
+
## About V-Retrver
|
| 10 |
+
|
| 11 |
+
V-Retrver enables a Multimodal Large Language Model (MLLM) to selectively acquire visual evidence during reasoning via external visual tools. It performs a **multimodal interleaved reasoning** process that alternates between hypothesis generation and targeted visual verification.
|
| 12 |
+
|
| 13 |
+
To train this agent, the authors adopted a curriculum-based learning strategy combining:
|
| 14 |
+
- **Supervised Reasoning Activation:** Initial activation of retrieval-specific reasoning.
|
| 15 |
+
- **Rejection-Based Refinement:** Improving reasoning reliability via rejection sampling.
|
| 16 |
+
- **Reinforcement Learning:** Fine-tuning with an evidence-aligned objective.
|
| 17 |
+
|
| 18 |
+
Experiments across multiple benchmarks demonstrate significant improvements in retrieval accuracy (averaging 23.0%), perception-driven reasoning reliability, and generalization.
|
| 19 |
+
|
| 20 |
+
## Resources
|
| 21 |
+
|
| 22 |
+
- **Paper:** [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://huggingface.co/papers/2602.06034)
|
| 23 |
+
- **GitHub Repository:** [https://github.com/chendy25/V-Retrver](https://github.com/chendy25/V-Retrver)
|
| 24 |
+
|
| 25 |
+
## Citation
|
| 26 |
+
|
| 27 |
+
If you find this work helpful, please consider citing:
|
| 28 |
+
|
| 29 |
+
```bibtex
|
| 30 |
+
@article{chen2026vretrver,
|
| 31 |
+
title={V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval},
|
| 32 |
+
author={Chen, Dongyang and Wang, Chaoyang and Su, Dezhao and Xiao, Xi and Zhang, Zeyu and Xiong, Jing and Li, Qing and Shang, Yuzhang and Ka, Shichao},
|
| 33 |
+
journal={arXiv preprint arXiv:2602.06034},
|
| 34 |
+
year={2026}
|
| 35 |
+
}
|
| 36 |
+
```
|