|
|
--- |
|
|
license: apache-2.0 |
|
|
extra_gated_eu_disallowed: true |
|
|
pipeline_tag: video-text-to-text |
|
|
library_name: transformers |
|
|
base_model: Qwen/Qwen2.5-VL-7B-Instruct |
|
|
tags: |
|
|
- spatial-reasoning |
|
|
- 4d-vision |
|
|
- vlm |
|
|
--- |
|
|
|
|
|
# Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models |
|
|
|
|
|
This repository contains the model weights for the **DSR Suite**, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557). |
|
|
|
|
|
## Introduction |
|
|
Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in **Dynamic Spatial Reasoning (DSR)** – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce **DSR Suite**, which comprises: |
|
|
|
|
|
1. **Automated Data Generation Pipeline**: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR. |
|
|
2. **DSR-Train**: A training dataset of 50K QAs generated by the pipeline. |
|
|
3. **DSR-Bench**: A human-refined benchmark with 1484 QAs for rigorous evaluation. |
|
|
4. **Geometry Selection Module (GSM)**: A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a **Qwen2.5-VL-7B** backbone, without compromising general understanding capabilities. |
|
|
|
|
|
Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning. |
|
|
|
|
|
## Resources |
|
|
- **Paper**: [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557) |
|
|
- **GitHub Repository**: [https://github.com/TencentARC/DSR_Suite](https://github.com/TencentARC/DSR_Suite) |
|
|
- **Hugging Face Dataset**: [TencentARC/DSR_Suite-Data](https://huggingface.co/datasets/TencentARC/DSR_Suite-Data) |
|
|
- **Hugging Face Collection**: [TencentARC/dsr-suite](https://huggingface.co/collections/TencentARC/dsr-suite) |
|
|
|
|
|
## Usage and Evaluation |
|
|
For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official [DSR_Suite GitHub repository](https://github.com/TencentARC/DSR_Suite). |
|
|
|
|
|
The evaluation framework is based on [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). An example command for evaluating a trained model (like `Qwen2.5-VL-7B-Instruct-ForVideo-Spatial`) on the `Spatial-Reasoning` task is: |
|
|
|
|
|
```bash |
|
|
cd VLMEvalKit_mine |
|
|
CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
If you find our work useful, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{zhou2025learning, |
|
|
title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models}, |
|
|
author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi}, |
|
|
year={2025}, |
|
|
eprint={2512.20557}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2512.20557}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgement |
|
|
This work builds upon the following projects: |
|
|
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL): The model codebase we built upon. |
|
|
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): The evaluation framework we built upon. |
|
|
- [Grounded SAM2](https://github.com/IDEA-Research/Grounded-SAM-2), [Orient Anything](https://github.com/SpatialVision/Orient-Anything), [π^3](https://github.com/yyfz/Pi3): Models used in our data generation pipeline to extract 3D cues. |
|
|
- [Koala-36M](https://github.com/KlingTeam/Koala-36M): The video database we build QAs upon. |