arxiv:2605.16079

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Published on May 15

· Submitted by

Authors:

Abstract

VideoSeeker introduces a novel paradigm for instance-level video understanding by integrating agentic reasoning with visual prompts, achieving superior performance through automated data synthesis and reinforcement learning.

AI-generated summary

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

gaotiexinqu

Paper submitter about 11 hours ago

To address the ambiguity of pure text-based references in existing video Agentic RL approaches, we propose a multi-turn framework that integrates visual tools with video understanding tasks, enabling active perception through visual prompts and on-demand retrieval. We further develop a four-stage data pipeline to construct large-scale instance-level video understanding data, allowing the model to internalize tool usage and active perception capabilities. Our method achieves an average performance improvement of 13%, surpassing advanced closed-source models such as GPT-4o and Gemini-2.5-Pro, while also demonstrating strong generalization on generic video understanding benchmarks.
Project Page: https://gaotiexinqu.github.io/VideoSeeker/
Code: https://github.com/gaotiexinqu/VideoSeeker

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.16079

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.16079 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.16079 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.16079 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.