MTGS-static-gazefollow

Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed models that can handle only one person at a time and are static, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. It comprises a temporal, transformer-based architecture that, in addition to frame tokens, handles person-specific tokens capturing the gaze information related to each individual. We demonstrate that our model can address and benefit from training on all tasks jointly, achieving state-of-the-art results for multi-person gaze following and social gaze prediction.

Overview

Training: MTGS-static-gazefollow was trained on GazeFollow
Parameters: 155M
Task: MTGS performs multi-person gaze following and social gaze prediction in images and videos. Given an image or a video frame with multiple people, the model predicts where each person is looking in the scene (gaze following) and infers pair-wise social gaze interactions among individuals (social gaze prediction). Specifically, we tackle three social gaze prediction tasks:
1. Looking at Heads (LAH): whether a person is looking at another person's head.
2. Looking at Each Other (LAEO): whether two people are looking at each other.
3. Shared Attention (SA): whether two people are looking at the same target in the scene.
Framework: PyTorch Lightning

License

CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Minimal code to instantiate the model and perform inference:

Refer to instructions in the official repository: https://github.com/idiap/MTGS

Model performance

Method	Resolution	AUC	Avg. Dist.	Min. Dist.
Gaze-LLE [CVPR'25]	448x448	0.956	0.104	0.045
MTGS-static [NeurIPS'24]	224x224	0.929	0.116	0.059
MTGS-DINO-static	224x224	0.944	0.101	0.045
MTGS-DINO-static**	448x448	0.948	0.094	0.041

** this is the model being published

Citation

If you use these model, please cite the following publication:

@article{gupta2024mtgs,
title={MTGS: A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction},
author={Gupta, Anshul and Tafasca, Samy and Farkhondeh, Arya and Vuillecard, Pierre and Odobez, Jean-Marc},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={15646--15673},
year={2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Idiap/mtgs-static-gazefollow

MTGS

Collection

MTGS: A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction. Code : https://github.com/idiap/MTGS • 4 items • Updated Feb 18 • 1