arxiv:2603.15558

Panoramic Affordance Prediction

Published on Mar 16

· Submitted by

Chenfei Liao on Mar 17

Upvote

Authors:

Zixin Zhang ,

Chenfei Liao ,

Litao Guo ,

Yinchuan Li ,

Abstract

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

View arXiv page View PDF Project page GitHub 28 Add to collection

Community

Chenfei-Liao

Paper author Paper submitter about 16 hours ago

•

edited about 14 hours ago

In this work, we are the first to explore affordance prediction in panoramic environments.
🔥On the data side, we construct PAP-12K, the first ultra-high-resolution affordance prediction dataset specifically designed for panoramic scenes. PAP-12K contains over 12,000 high-quality, reasoning-centric question–answer pairs collected from real-world panoramas and manually annotated.
🔥On the algorithm side, we propose PAP, a training-free panoramic affordance prediction framework inspired by human visual perception. With three key components: Recursive Visual Routing, Adaptive Gaze, and Cascaded Affordance Grounding, PAP effectively overcomes three unique challenges of panoramic vision: geometric distortion, boundary discontinuity, and extreme scale variation. Without any fine-tuning, PAP achieves state-of-the-art performance on panoramic affordance prediction, and demonstrates strong accuracy and robustness, especially when dealing with cross-boundary targets and tiny objects.
Both our ultra-high-resolution dataset and the complete inference code have been open-sourced. We warmly welcome you to check out the project. Thank you for your support!
GitHub: https://github.com/EnVision-Research/PAP
Project Page: https://zixinzhang02.github.io/Panoramic-Affordance-Prediction/
Paper: https://arxiv.org/abs/2603.15558

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.15558 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.15558 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.15558 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.