zhousc nielsr HF Staff commited on
Commit
d0df6d2
·
verified ·
1 Parent(s): 5de562d

Add model card and metadata for DSR Suite model (#1)

Browse files

- Add model card and metadata for DSR Suite model (2b0baa8a762897ed26e0f132e1e23596bed670a6)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +63 -4
README.md CHANGED
@@ -1,4 +1,63 @@
1
- ---
2
- license: apache-2.0
3
- extra_gated_eu_disallowed: true
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ extra_gated_eu_disallowed: true
4
+ pipeline_tag: video-text-to-text
5
+ library_name: transformers
6
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
7
+ tags:
8
+ - spatial-reasoning
9
+ - 4d-vision
10
+ - vlm
11
+ ---
12
+
13
+ # Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
14
+
15
+ This repository contains the model weights for the **DSR Suite**, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557).
16
+
17
+ ## Introduction
18
+ Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in **Dynamic Spatial Reasoning (DSR)** – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce **DSR Suite**, which comprises:
19
+
20
+ 1. **Automated Data Generation Pipeline**: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR.
21
+ 2. **DSR-Train**: A training dataset of 50K QAs generated by the pipeline.
22
+ 3. **DSR-Bench**: A human-refined benchmark with 1484 QAs for rigorous evaluation.
23
+ 4. **Geometry Selection Module (GSM)**: A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a **Qwen2.5-VL-7B** backbone, without compromising general understanding capabilities.
24
+
25
+ Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning.
26
+
27
+ ## Resources
28
+ - **Paper**: [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557)
29
+ - **GitHub Repository**: [https://github.com/TencentARC/DSR_Suite](https://github.com/TencentARC/DSR_Suite)
30
+ - **Hugging Face Dataset**: [TencentARC/DSR_Suite-Data](https://huggingface.co/datasets/TencentARC/DSR_Suite-Data)
31
+ - **Hugging Face Collection**: [TencentARC/dsr-suite](https://huggingface.co/collections/TencentARC/dsr-suite)
32
+
33
+ ## Usage and Evaluation
34
+ For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official [DSR_Suite GitHub repository](https://github.com/TencentARC/DSR_Suite).
35
+
36
+ The evaluation framework is based on [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). An example command for evaluating a trained model (like `Qwen2.5-VL-7B-Instruct-ForVideo-Spatial`) on the `Spatial-Reasoning` task is:
37
+
38
+ ```bash
39
+ cd VLMEvalKit_mine
40
+ CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning
41
+ ```
42
+
43
+ ## Citation
44
+ If you find our work useful, please consider citing:
45
+
46
+ ```bibtex
47
+ @misc{zhou2025learning,
48
+ title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models},
49
+ author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi},
50
+ year={2025},
51
+ eprint={2512.20557},
52
+ archivePrefix={arXiv},
53
+ primaryClass={cs.CV},
54
+ url={https://arxiv.org/abs/2512.20557},
55
+ }
56
+ ```
57
+
58
+ ## Acknowledgement
59
+ This work builds upon the following projects:
60
+ - [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL): The model codebase we built upon.
61
+ - [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): The evaluation framework we built upon.
62
+ - [Grounded SAM2](https://github.com/IDEA-Research/Grounded-SAM-2), [Orient Anything](https://github.com/SpatialVision/Orient-Anything), [π^3](https://github.com/yyfz/Pi3): Models used in our data generation pipeline to extract 3D cues.
63
+ - [Koala-36M](https://github.com/KlingTeam/Koala-36M): The video database we build QAs upon.