Add metadata, paper and code links to model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +31 -39
README.md CHANGED
@@ -1,3 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
1
  # WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
2
 
3
  <p align="center">
@@ -5,88 +16,69 @@
5
  </p>
6
 
7
  ![version](https://img.shields.io/badge/version-1.0.0-blue)
 
 
 
8
 
9
  ## πŸ₯‡ Introduction
10
  In this paper, we introduce **WebWatcher**, a multimodal agent for deep research that possesses enhanced visual-language reasoning capabilities. Our work presents a unified framework that combines complex vision-language reasoning with multi-tool interaction.
11
 
 
 
12
  <p align="center">
13
  <img src="./assets/webwatcher_main.png" alt="logo" width="80%"/>
14
  </p>
15
 
16
-
17
  Key features of our approach include:
18
 
19
  <p align="center">
20
  <img src="./assets/distribution_level.png" alt="logo" width="80%"/>
21
  </p>
22
 
23
- - BrowseComp-VL Benchmark: We propose a new benchmark, BrowseComp-VL, to evaluate the capabilities of multimodal agents. This challenging dataset is designed for in-depth multimodal reasoning and strategic planning, mirroring the complexity of BrowseComp but extending it into the visual domain. It emphasizes tasks that require both visual perception and advanced information-gathering abilities.
24
 
25
  <p align="center">
26
  <img src="./assets/data_pipelines.png" alt="logo" width="80%"/>
27
  </p>
28
 
29
- - Automated Trajectory Generation: To provide robust tool-use capabilities, we developed an automated pipeline to generate high-quality, multi-step reasoning trajectories. These trajectories, which are grounded in actual tool-use behavior and reflect procedural decision-making, are used for efficient cold-start training and further optimization via reinforcement learning. The agent is equipped with several tools, including Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool.
30
 
31
- - Superior Performance: WebWatcher significantly outperforms proprietary baselines, RAG workflows, and other open-source agents across four challenging VQA benchmarks: Humanity's Last Exam (HLE)-VL, BrowseComp-VL, LiveVQA, and MMSearch. The WebWatcher-32B model, in particular, achieves an average score of 18.2% on HLE, surpassing the GPT-4o-based OmniSearch baseline. It also achieves top-tier performance on LiveVQA (58.7%) and MMSearch (55.3%), demonstrating stable and superior results on demanding, real-world visual search benchmarks.
32
 
33
  ## πŸš€ Performance Highlights
34
  <p align="center">
35
  <img src="./assets/webwatcher_performance_general.png" alt="logo" width="80%"/>
36
  </p>
37
 
38
- 1. Complex Reasoning (HLE-VL): On the Human Life Exam (HLE-VL), a benchmark for multi-step complex reasoning, WebWatcher achieved a commanding lead with a Pass@1 score of 13.6%, substantially outperforming representative models including GPT-4o (9.8%), Gemini2.5-flash (9.2%), and Qwen2.5-VL-72B (8.6%).
39
 
40
- 2. Information Retrieval (MMSearch): In the MMSearch evaluation, WebWatcher demonstrated exceptional retrieval accuracy with a Pass@1 score of 55.3%, significantly surpassing Gemini2.5-flash (43.9%) and GPT-4o (24.1%), showcasing superior precision in retrieval tasks and robust information aggregation capabilities in complex scenarios.
41
 
42
- 3. Knowledge-Retrieval Integration (LiveVQA): On the LiveVQA benchmark, WebWatcher achieved a Pass@1 score of 58.7%, outperforming Gemini2.5-flash (41.3%), Qwen2.5-VL-72B (35.7%), and GPT-4o (34.0%).
43
-
44
- 4. Information Optimization and Aggregation (BrowseComp-VL): On BrowseComp-VL, the most comprehensively challenging benchmark, WebWatcher dominated with an average score of 27.0%, more than doubling the performance of mainstream models including GPT-4o (13.4%), Gemini2.5-flash (13.0%), and Claude-3.7 (11.2%).
45
 
 
46
 
47
  ## πŸ”§ Quick Start
48
 
49
- ### Step 1: Download the WebWatcher model
50
-
51
- You can download WebWatcher via Hugging Face [πŸ€— HuggingFace](https://huggingface.co/Alibaba-NLP/).
52
-
53
- ### Step 2: Data Preparation
54
-
55
- Before running inference, test set images need to be downloaded to the `infer/scripts_eval/images` folder. This can be accomplished by running `infer/scripts_eval/download_image.py`. If you encounter issues downloading images from our provided OSS URLs, please obtain the images from the original dataset source and place them in the corresponding `infer/scripts_eval/images` folder.
56
-
57
- ### Step 3: Inference
58
-
59
- Run `infer/scripts_eval/scripts/eval.sh` with the following required parameters:
60
-
61
- - **benchmark**: Name of the dataset to test. Available options: `'hle'`, `'gaia'`, `'livevqa'`, `'mmsearch'`, `'simplevqa'`, `'bc_vl_v1'`, `'bc_vl_v2'`. These test sets should be pre-stored in `infer/vl_search_r1/eval_data` with naming convention like `hle.jsonl`. We have provided format examples for some datasets in `infer/vl_search_r1/eval_data`. If extending to new datasets, please ensure consistent formatting.
62
- - **EXPERIMENT_NAME**: Name for this experiment (user-defined)
63
- - **MODEL_PATH**: Path to the trained model
64
- - **DASHSCOPE_API_KEY**: GPT API key
65
- - **IMG_SEARCH_KEY**: Google SerpApi key for image search
66
- - **JINA_API_KEY**: Jina API key
67
- - **SCRAPERAPI_KEY**: Scraper API key
68
- - **QWEN_SEARCH_KEY**: Google SerpApi key for text search
69
-
70
- **Note**: For image search tools, if you need to upload searched images to OSS, the following are required:
71
- - **ALIBABA_CLOUD_ACCESS_KEY_ID**: Alibaba Cloud OSS access key ID
72
- - **ALIBABA_CLOUD_ACCESS_KEY_SECRET**: Alibaba Cloud OSS access key secret
73
-
74
- ### Step 4: Evaluation
75
 
76
- Run `infer/vl_search_r1/pass3.sh` to use LLM-as-judge for evaluating Pass@3 and Pass@1 metrics. Parameters:
77
 
78
- - **DIRECTORY**: Path to the folder containing JSONL files generated from inference
79
- - **DASHSCOPE_API_KEY**: GPT API key
80
 
 
 
 
81
 
82
  ## πŸ“‘ Citation
83
 
84
  If this work is helpful, please kindly cite as:
85
 
86
- ```bigquery
87
  @article{geng2025webwatcher,
88
  title={WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent},
89
  author={Geng, Xinyu and Xia, Peng and Zhang, Zhen and Wang, Xinyu and Wang, Qiuchen and Ding, Ruixue and Wang, Chenxi and Wu, Jialong and Zhao, Yida and Li, Kuan and others},
90
  journal={arXiv preprint arXiv:2508.05748},
91
  year={2025}
92
  }
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - multimodal
7
+ - vision-language
8
+ - agent
9
+ - deep-research
10
+ ---
11
+
12
  # WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
13
 
14
  <p align="center">
 
16
  </p>
17
 
18
  ![version](https://img.shields.io/badge/version-1.0.0-blue)
19
+ [![Paper](https://img.shields.io/badge/Paper-2508.05748-red)](https://huggingface.co/papers/2508.05748)
20
+ [![GitHub](https://img.shields.io/badge/GitHub-Repo-green)](https://github.com/Alibaba-NLP/WebAgent)
21
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/)
22
 
23
  ## πŸ₯‡ Introduction
24
  In this paper, we introduce **WebWatcher**, a multimodal agent for deep research that possesses enhanced visual-language reasoning capabilities. Our work presents a unified framework that combines complex vision-language reasoning with multi-tool interaction.
25
 
26
+ This model was presented in the paper [WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent](https://huggingface.co/papers/2508.05748).
27
+
28
  <p align="center">
29
  <img src="./assets/webwatcher_main.png" alt="logo" width="80%"/>
30
  </p>
31
 
 
32
  Key features of our approach include:
33
 
34
  <p align="center">
35
  <img src="./assets/distribution_level.png" alt="logo" width="80%"/>
36
  </p>
37
 
38
+ - **BrowseComp-VL Benchmark**: We propose a new benchmark, BrowseComp-VL, to evaluate the capabilities of multimodal agents. This challenging dataset is designed for in-depth multimodal reasoning and strategic planning, mirroring the complexity of BrowseComp but extending it into the visual domain. It emphasizes tasks that require both visual perception and advanced information-gathering abilities.
39
 
40
  <p align="center">
41
  <img src="./assets/data_pipelines.png" alt="logo" width="80%"/>
42
  </p>
43
 
44
+ - **Automated Trajectory Generation**: To provide robust tool-use capabilities, we developed an automated pipeline to generate high-quality, multi-step reasoning trajectories. These trajectories, which are grounded in actual tool-use behavior and reflect procedural decision-making, are used for efficient cold-start training and further optimization via reinforcement learning. The agent is equipped with several tools, including Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool.
45
 
46
+ - **Superior Performance**: WebWatcher significantly outperforms proprietary baselines, RAG workflows, and other open-source agents across four challenging VQA benchmarks: Humanity's Last Exam (HLE)-VL, BrowseComp-VL, LiveVQA, and MMSearch. The WebWatcher-32B model achieves top-tier performance, demonstrating stable and superior results on demanding, real-world visual search benchmarks.
47
 
48
  ## πŸš€ Performance Highlights
49
  <p align="center">
50
  <img src="./assets/webwatcher_performance_general.png" alt="logo" width="80%"/>
51
  </p>
52
 
53
+ 1. **Complex Reasoning (HLE-VL)**: On the Human Life Exam (HLE-VL), WebWatcher achieved a Pass@1 score of 13.6%, substantially outperforming representative models including GPT-4o (9.8%) and Qwen2.5-VL-72B (8.6%).
54
 
55
+ 2. **Information Retrieval (MMSearch)**: WebWatcher demonstrated exceptional retrieval accuracy with a Pass@1 score of 55.3%, significantly surpassing Gemini2.5-flash (43.9%) and GPT-4o (24.1%).
56
 
57
+ 3. **Knowledge-Retrieval Integration (LiveVQA)**: On the LiveVQA benchmark, WebWatcher achieved a Pass@1 score of 58.7%, outperforming Gemini2.5-flash (41.3%) and GPT-4o (34.0%).
 
 
58
 
59
+ 4. **Information Optimization and Aggregation (BrowseComp-VL)**: On BrowseComp-VL, WebWatcher dominated with an average score of 27.0%, more than doubling the performance of mainstream models including GPT-4o (13.4%).
60
 
61
  ## πŸ”§ Quick Start
62
 
63
+ The model can be used with the `transformers` library. For detailed instructions on running the agent with its associated tools (web search, OCR, etc.), please refer to the [official GitHub repository](https://github.com/Alibaba-NLP/WebAgent).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
+ ### Inference
66
 
67
+ Running inference typically involves using the provided scripts in the repository:
 
68
 
69
+ 1. **Download the WebWatcher model** via Hugging Face.
70
+ 2. **Data Preparation**: Download test set images to the `infer/scripts_eval/images` folder.
71
+ 3. **Run Inference**: Use `infer/scripts_eval/scripts/eval.sh` with the required parameters (MODEL_PATH, API keys for tools, etc.).
72
 
73
  ## πŸ“‘ Citation
74
 
75
  If this work is helpful, please kindly cite as:
76
 
77
+ ```bibtex
78
  @article{geng2025webwatcher,
79
  title={WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent},
80
  author={Geng, Xinyu and Xia, Peng and Zhang, Zhen and Wang, Xinyu and Wang, Qiuchen and Ding, Ruixue and Wang, Chenxi and Wu, Jialong and Zhao, Yida and Li, Kuan and others},
81
  journal={arXiv preprint arXiv:2508.05748},
82
  year={2025}
83
  }
84
+ ```