EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
Abstract
EgoVITA is a framework for egocentric video understanding that uses a plan-then-verify approach with cross-perspective feedback to improve reasoning accuracy and generalization.
Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We introduce EgoVITA, a framework that decomposes egocentric video reasoning into a structured plan-then-verify process. The model first generates an egocentric plan: a causal sequence of anticipated actions from a first-person perspective. This plan is then evaluated by an exocentric verification stage that uses third-person reasoning over the same video to verify its spatiotemporal and logical consistency, without exocentric video input. This decomposition enables cross-perspective feedback without requiring paired ego-exo supervision. To train this reasoning process, we adopt Group Relative Policy Optimization (GRPO) with two dense reward signals: one that grounds anticipated actions in subsequent visual observations and another that reinforces consistent third-person verification. EgoVITA achieves state-of-the-art performance on egocentric reasoning benchmarks, outperforming Qwen2.5-VL-7B by +7.7 on EgoBlind and +4.4 on EgoOrient, while maintaining strong generalization on exocentric video tasks with only 52k training samples.
Get this paper in your agent:
hf papers read 2511.18242 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper