arxiv:2511.18242

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

Published on Jun 30

Authors:

Abstract

EgoVITA is a framework for egocentric video understanding that uses a plan-then-verify approach with cross-perspective feedback to improve reasoning accuracy and generalization.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We introduce EgoVITA, a framework that decomposes egocentric video reasoning into a structured plan-then-verify process. The model first generates an egocentric plan: a causal sequence of anticipated actions from a first-person perspective. This plan is then evaluated by an exocentric verification stage that uses third-person reasoning over the same video to verify its spatiotemporal and logical consistency, without exocentric video input. This decomposition enables cross-perspective feedback without requiring paired ego-exo supervision. To train this reasoning process, we adopt Group Relative Policy Optimization (GRPO) with two dense reward signals: one that grounds anticipated actions in subsequent visual observations and another that reinforces consistent third-person verification. EgoVITA achieves state-of-the-art performance on egocentric reasoning benchmarks, outperforming Qwen2.5-VL-7B by +7.7 on EgoBlind and +4.4 on EgoOrient, while maintaining strong generalization on exocentric video tasks with only 52k training samples.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2511.18242

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.18242 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.18242 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.