arxiv:2606.19341

Native Active Perception as Reasoning for Omni-Modal Understanding

Published on Jun 17

· Submitted by

Zhenghao Xing on Jun 18

Qwen

Upvote

Authors:

Abstract

OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

View arXiv page View PDF GitHub 9 Add to collection

Community

harryhsing

Paper submitter about 2 hours ago

•

edited about 2 hours ago

TL;DR: OmniAgent is, to our knowledge, the first native omni-modal agent that turns video understanding into active perception. Instead of watching every frame, it runs an Observation–Thought–Action loop, fetching only the frames, audio, or clips it needs and distilling each percept into a persistent textual memory — so reasoning cost is decoupled from video length and it scales to hour-long videos.

A few things we think are worth a look:

Single native model, no tool orchestration — the environment only returns raw media (frames / audio / clips); all perception and reasoning happen inside one model, not external captioners or detectors.
TAURA — our RL objective uses turn-level entropy to steer credit toward pivotal "discovery" turns, addressing the advantage-homogenization problem vanilla GRPO has in multi-turn agents.
Positive test-time scaling — accuracy keeps climbing as we raise the maximum turn limit K (+6.2% on VideoMME-Long), while the actual turns saturate (~11.7): the agent stops once it has enough evidence rather than exhausting the budget.
Large temporal-grounding gains — on audio-visual grounding, LongVALE IoU jumps from 5.7 to 39.1 (+33.4), where on-demand sampling pins down events a single global pass misses.
State-of-the-art among open-source models — across ten benchmarks; notably, a 7B agent beats the 10× larger Qwen2.5-VL-72B on LVBench (50.5 vs. 47.3) using ~73% fewer frames.
A live inference trace from the web demo.

We've released the code, both SFT and RL checkpoints, and the full agent instruction template. The appendix also walks through complete reasoning traces — the agent browsing, listening, and watching its way to an answer — if you'd like to see active perception in action. We'll be around in the comments; questions and feedback very welcome!

Code: https://github.com/harryhsing/OmniAgent

The OmniAgent framework.