JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

3D AV-LLM leveraging RGB-D and First-Order Ambisonics for end-to-end grounding and spatial reasoning

📖 Abstract

We present JAEGER, a framework that extends audio-visual LLMs from 2D to 3D space through joint RGB-D observations and multi-channel first-order ambisonics (FOA), enabling reliable source localization and spatial reasoning in complex 3D environments. At its core is the Neural Intensity Vector (Neural IV), a learned spatial audio representation that produces robust directional cues even under reverberation and overlapping sources. We further release SpatialSceneQA, a 61k-sample benchmark with degree-level azimuth/elevation supervision for direction-of-arrival estimation, 3D box grounding, and multi-speaker matching.

Overview of the JAEGER framework.

📦 SpatialSceneQA Dataset

A 61k-sample spatial audio-visual benchmark pairing RGB-D observations with 4-channel first-order ambisonic audio under degree-level 3D supervision. SpatialSceneQA is released in this repository under datasets/SpatialSceneQA/, with train, validation, and test archives for the HM3D FOA audio-visual setting.

SpatialSceneQA data construction pipeline

Data construction pipeline of SpatialSceneQA.

🔧 Checkpoints

JAEGER checkpoints are released under checkpoints/, including Classical IV and Neural IV variants for audio-only, audio-visual, and visual-grounding tasks.

📝 Citation

@inproceedings{liu2026jaeger,
  title={JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments},
  author={Liu, Zhan and Tang, Changli and Wang, Yuxin and Zhu, Zhiyuan and Chen, Youjun and Shao, Yiwen and Wang, Tianzi and Ke, Lei and Jin, Zengrui and Zhang, Chao},
  booktitle={Proc. ICML},
  year={2026}
}

Downloads last month: 16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including tsinghua-ee/JAEGER

Spatial Audio & Visual

Collection

Spatial Audio & Visual LLMs • 2 items • Updated 20 days ago

Paper for tsinghua-ee/JAEGER

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Paper • 2602.18527 • Published Feb 20 • 2