Papers
arxiv:2512.20557

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Published on Dec 23
· Submitted by
Wei Huang
on Dec 25
#1 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

DSR Suite enhances vision-language models with dynamic spatial reasoning through automated data generation and a geometry selection module that integrates geometric priors.

AI-generated summary

Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

Community

DSR Suite delivers scalable 4D training/evaluation from real-world videos and a lightweight GSM module that injects targeted geometric priors into VLMs, markedly boosting dynamic spatial reasoning while preserving general video understanding.

Key Findings:

  1. Automated pipeline converts unconstrained real videos into DSR-focused multi-choice QA, leveraging camera poses, local point clouds, object masks/poses, and 3D trajectories to build DSR-Train and the human-refined DSR-Bench.
  2. GSM integrates geometry via two stacked Q-Formers: one compresses question semantics; the other selects relevant 4D priors to form compact geometric tokens, reducing noise to the LM.
  3. Strong benchmark gains: Qwen2.5-VL-7B + GSM trained on DSR-Train achieves 58.9% on DSR-Bench, outperforming open/closed-source baselines (23.5%–38.4%), while maintaining performance on general video benchmarks.
  4. Better downstream agency: the trained model improves success on dynamic-agent tasks (e.g., MineDojo), excelling at interactions, combat, and resource acquisition in Minecraft.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.20557 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.20557 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.20557 in a Space README.md to link it from this page.

Collections including this paper 2