license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- multimodal
- vision-language
- agent
- deep-research
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
π₯ Introduction
In this paper, we introduce WebWatcher, a multimodal agent for deep research that possesses enhanced visual-language reasoning capabilities. Our work presents a unified framework that combines complex vision-language reasoning with multi-tool interaction.
This model was presented in the paper WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent.
Key features of our approach include:
- BrowseComp-VL Benchmark: We propose a new benchmark, BrowseComp-VL, to evaluate the capabilities of multimodal agents. This challenging dataset is designed for in-depth multimodal reasoning and strategic planning, mirroring the complexity of BrowseComp but extending it into the visual domain. It emphasizes tasks that require both visual perception and advanced information-gathering abilities.
Automated Trajectory Generation: To provide robust tool-use capabilities, we developed an automated pipeline to generate high-quality, multi-step reasoning trajectories. These trajectories, which are grounded in actual tool-use behavior and reflect procedural decision-making, are used for efficient cold-start training and further optimization via reinforcement learning. The agent is equipped with several tools, including Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool.
Superior Performance: WebWatcher significantly outperforms proprietary baselines, RAG workflows, and other open-source agents across four challenging VQA benchmarks: Humanity's Last Exam (HLE)-VL, BrowseComp-VL, LiveVQA, and MMSearch. The WebWatcher-32B model achieves top-tier performance, demonstrating stable and superior results on demanding, real-world visual search benchmarks.
π Performance Highlights
Complex Reasoning (HLE-VL): On the Human Life Exam (HLE-VL), WebWatcher achieved a Pass@1 score of 13.6%, substantially outperforming representative models including GPT-4o (9.8%) and Qwen2.5-VL-72B (8.6%).
Information Retrieval (MMSearch): WebWatcher demonstrated exceptional retrieval accuracy with a Pass@1 score of 55.3%, significantly surpassing Gemini2.5-flash (43.9%) and GPT-4o (24.1%).
Knowledge-Retrieval Integration (LiveVQA): On the LiveVQA benchmark, WebWatcher achieved a Pass@1 score of 58.7%, outperforming Gemini2.5-flash (41.3%) and GPT-4o (34.0%).
Information Optimization and Aggregation (BrowseComp-VL): On BrowseComp-VL, WebWatcher dominated with an average score of 27.0%, more than doubling the performance of mainstream models including GPT-4o (13.4%).
π§ Quick Start
The model can be used with the transformers library. For detailed instructions on running the agent with its associated tools (web search, OCR, etc.), please refer to the official GitHub repository.
Inference
Running inference typically involves using the provided scripts in the repository:
- Download the WebWatcher model via Hugging Face.
- Data Preparation: Download test set images to the
infer/scripts_eval/imagesfolder. - Run Inference: Use
infer/scripts_eval/scripts/eval.shwith the required parameters (MODEL_PATH, API keys for tools, etc.).
π Citation
If this work is helpful, please kindly cite as:
@article{geng2025webwatcher,
title={WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent},
author={Geng, Xinyu and Xia, Peng and Zhang, Zhen and Wang, Xinyu and Wang, Qiuchen and Ding, Ruixue and Wang, Chenxi and Wu, Jialong and Zhao, Yida and Li, Kuan and others},
journal={arXiv preprint arXiv:2508.05748},
year={2025}
}