JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
3D AV-LLM leveraging RGB-D and First-Order Ambisonics for end-to-end grounding and spatial reasoning
π Abstract
We present JAEGER, a framework that extends audio-visual LLMs from 2D to 3D space through joint RGB-D observations and multi-channel first-order ambisonics (FOA), enabling reliable source localization and spatial reasoning in complex 3D environments. At its core is the Neural Intensity Vector (Neural IV), a learned spatial audio representation that produces robust directional cues even under reverberation and overlapping sources. We further release SpatialSceneQA, a 61k-sample benchmark with degree-level azimuth/elevation supervision for direction-of-arrival estimation, 3D box grounding, and multi-speaker matching.
Overview of the JAEGER framework.
π¦ SpatialSceneQA Dataset
A 61k-sample spatial audio-visual benchmark pairing RGB-D observations with 4-channel first-order ambisonic audio under degree-level 3D supervision. SpatialSceneQA is released in this repository under datasets/SpatialSceneQA/, with train, validation, and test archives for the HM3D FOA audio-visual setting.
Data construction pipeline of SpatialSceneQA.
π§ Checkpoints
JAEGER checkpoints are released under checkpoints/, including Classical IV and Neural IV variants for audio-only, audio-visual, and visual-grounding tasks.
π Citation
@inproceedings{liu2026jaeger,
title={JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments},
author={Liu, Zhan and Tang, Changli and Wang, Yuxin and Zhu, Zhiyuan and Chen, Youjun and Shao, Yiwen and Wang, Tianzi and Ke, Lei and Jin, Zengrui and Zhang, Chao},
booktitle={Proc. ICML},
year={2026}
}
- Downloads last month
- 16