File size: 3,770 Bytes
d0df6d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: apache-2.0
extra_gated_eu_disallowed: true
pipeline_tag: video-text-to-text
library_name: transformers
base_model: Qwen/Qwen2.5-VL-7B-Instruct
tags:
- spatial-reasoning
- 4d-vision
- vlm
---

# Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

This repository contains the model weights for the **DSR Suite**, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557).

## Introduction
Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in **Dynamic Spatial Reasoning (DSR)** – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce **DSR Suite**, which comprises:

1.  **Automated Data Generation Pipeline**: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR.
2.  **DSR-Train**: A training dataset of 50K QAs generated by the pipeline.
3.  **DSR-Bench**: A human-refined benchmark with 1484 QAs for rigorous evaluation.
4.  **Geometry Selection Module (GSM)**: A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a **Qwen2.5-VL-7B** backbone, without compromising general understanding capabilities.

Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning.

## Resources
- **Paper**: [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557)
- **GitHub Repository**: [https://github.com/TencentARC/DSR_Suite](https://github.com/TencentARC/DSR_Suite)
- **Hugging Face Dataset**: [TencentARC/DSR_Suite-Data](https://huggingface.co/datasets/TencentARC/DSR_Suite-Data)
- **Hugging Face Collection**: [TencentARC/dsr-suite](https://huggingface.co/collections/TencentARC/dsr-suite)

## Usage and Evaluation
For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official [DSR_Suite GitHub repository](https://github.com/TencentARC/DSR_Suite).

The evaluation framework is based on [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). An example command for evaluating a trained model (like `Qwen2.5-VL-7B-Instruct-ForVideo-Spatial`) on the `Spatial-Reasoning` task is:

```bash
cd VLMEvalKit_mine
CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning
```

## Citation
If you find our work useful, please consider citing:

```bibtex
@misc{zhou2025learning,
      title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models}, 
      author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi},
      year={2025},
      eprint={2512.20557},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.20557}, 
}
```

## Acknowledgement
This work builds upon the following projects:
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL): The model codebase we built upon.
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): The evaluation framework we built upon.
- [Grounded SAM2](https://github.com/IDEA-Research/Grounded-SAM-2), [Orient Anything](https://github.com/SpatialVision/Orient-Anything), [π^3](https://github.com/yyfz/Pi3): Models used in our data generation pipeline to extract 3D cues.
- [Koala-36M](https://github.com/KlingTeam/Koala-36M): The video database we build QAs upon.