arxiv:2605.15764

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Published on May 15

· Submitted by

Junho Kim on May 19

University of Illinois at Urbana-Champaign

Upvote

Authors:

Junho Kim ,

Abstract

GRASP is a large-scale social reasoning dataset connecting high-level social questions with fine-grained gaze and gesture events, along with Social Grounding Reward to improve multimodal model understanding of social interactions.

AI-generated summary

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

View arXiv page View PDF Add to collection

Community

arkimjh

Paper author Paper submitter about 2 hours ago

Current MLLMs can describe scenes, but still struggle to understand who is interacting with whom.

We introduce GRASP, a large-scale benchmark for grounded social reasoning built from gaze, gesture, and interaction events in multi-person videos. GRASP enables supervision for temporally grounded, participant-aware social reasoning, and our Social Grounding Reward (SGR) further improves evidence-aware reasoning in video MLLMs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15764 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15764 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15764 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.