Add model card and metadata for Vision-DeepResearch-8B

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +53 -3
README.md CHANGED
@@ -1,3 +1,53 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ base_model: Qwen/Qwen3-VL-8B-Instruct
6
+ tags:
7
+ - vision
8
+ - deep-research
9
+ - vdr-bench
10
+ - mllm
11
+ ---
12
+
13
+ # Vision-DeepResearch-8B (SFT-only)
14
+
15
+ Vision-DeepResearch-8B is a multimodal large language model (MLLM) optimized for deep research tasks involving complex visual and textual search. It is fine-tuned using Supervised Fine-Tuning (SFT) based on the [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) architecture.
16
+
17
+ The model was developed to address limitations in existing VQA benchmarks by focusing on realistic visual retrieval and fact-finding scenarios.
18
+
19
+ ## Project Resources
20
+
21
+ - **Project Page:** [https://osilly.github.io/Vision-DeepResearch/](https://osilly.github.io/Vision-DeepResearch/)
22
+ - **GitHub Repository:** [https://github.com/Osilly/Vision-DeepResearch](https://github.com/Osilly/Vision-DeepResearch)
23
+ - **Paper (VDR-Bench):** [Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models](https://huggingface.co/papers/2602.02185)
24
+ - **Paper (Vision-DeepResearch):** [Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models](https://arxiv.org/abs/2601.22060)
25
+
26
+ ## Performance
27
+
28
+ Vision-DeepResearch-8B demonstrates significant improvements over agentic and RAG-based workflows across several benchmarks:
29
+
30
+ | Model | VDR | MMSearch | LiveVQA | Avg. |
31
+ | :--- | :--- | :--- | :--- | :--- |
32
+ | Qwen3-VL-8B-Instruct (Agentic) | 17.0 | 52.0 | 63.0 | 40.1 |
33
+ | **Vision-DeepResearch-8B (Ours)** | **29.2** | **69.6** | **76.7** | **50.5** |
34
+
35
+ ## Citation
36
+
37
+ If you find this model or benchmark useful for your research, please cite:
38
+
39
+ ```bibtex
40
+ @article{zeng2026visiondeepresearch,
41
+ title={Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models},
42
+ author={Yu Zeng and Wenxuan Huang and Zhen Fang and Shuang Chen and Yufan Shen and Yishuo Cai and Xiaoman Wang and Zhenfei Yin and Lin Chen and Zehui Chen and Shiting Huang and Yiming Zhao and Yao Hu and Philip Torr and Wanli Ouyang and Shaosheng Cao},
43
+ journal={arXiv preprint arXiv:2601.22060},
44
+ year={2026}
45
+ }
46
+
47
+ @article{zeng2026vdrbench,
48
+ title={Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models},
49
+ author={Yu Zeng and Wenxuan Huang and Zhen Fang and Shuang Chen and Yufan Shen and Yishuo Cai and Xiaoman Wang and Zhenfei Yin and Lin Chen and Zehui Chen and Shiting Huang and Yiming Zhao and Yao Hu and Philip Torr and Wanli Ouyang and Shaosheng Cao},
50
+ journal={arXiv preprint arXiv:2602.02185},
51
+ year={2026}
52
+ }
53
+ ```