JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

3D AV-LLM leveraging RGB-D and First-Order Ambisonics for end-to-end grounding and spatial reasoning


πŸ“– Abstract

We present JAEGER, a framework that extends audio-visual LLMs from 2D to 3D space through joint RGB-D observations and multi-channel first-order ambisonics (FOA), enabling reliable source localization and spatial reasoning in complex 3D environments. At its core is the Neural Intensity Vector (Neural IV), a learned spatial audio representation that produces robust directional cues even under reverberation and overlapping sources. We further release SpatialSceneQA, a 61k-sample benchmark with degree-level azimuth/elevation supervision for direction-of-arrival estimation, 3D box grounding, and multi-speaker matching.

JAEGER architecture
Overview of the JAEGER framework.

πŸ“¦ SpatialSceneQA Dataset

A 61k-sample spatial audio-visual benchmark pairing RGB-D observations with 4-channel first-order ambisonic audio under degree-level 3D supervision. SpatialSceneQA is released in this repository under datasets/SpatialSceneQA/, with train, validation, and test archives for the HM3D FOA audio-visual setting.

SpatialSceneQA data construction pipeline
Data construction pipeline of SpatialSceneQA.

πŸ”§ Checkpoints

JAEGER checkpoints are released under checkpoints/, including Classical IV and Neural IV variants for audio-only, audio-visual, and visual-grounding tasks.


πŸ“ Citation

@inproceedings{liu2026jaeger,
  title={JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments},
  author={Liu, Zhan and Tang, Changli and Wang, Yuxin and Zhu, Zhiyuan and Chen, Youjun and Shao, Yiwen and Wang, Tianzi and Ke, Lei and Jin, Zengrui and Zhang, Chao},
  booktitle={Proc. ICML},
  year={2026}
}
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including tsinghua-ee/JAEGER

Paper for tsinghua-ee/JAEGER