DSR_Suite-Model / README.md

Add model card and metadata for DSR Suite model (#1)

d0df6d2 verified about 4 hours ago

3.77 kB

	---
	license: apache-2.0
	extra_gated_eu_disallowed: true
	pipeline_tag: video-text-to-text
	library_name: transformers
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	tags:
	- spatial-reasoning
	- 4d-vision
	- vlm
	---

	# Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

	This repository contains the model weights for the DSR Suite, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557).

	## Introduction
	Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in Dynamic Spatial Reasoning (DSR) – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce DSR Suite, which comprises:

	1. Automated Data Generation Pipeline: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR.
	2. DSR-Train: A training dataset of 50K QAs generated by the pipeline.
	3. DSR-Bench: A human-refined benchmark with 1484 QAs for rigorous evaluation.
	4. Geometry Selection Module (GSM): A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a Qwen2.5-VL-7B backbone, without compromising general understanding capabilities.

	Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning.

	## Resources
	- Paper: [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557)
	- GitHub Repository: [https://github.com/TencentARC/DSR_Suite](https://github.com/TencentARC/DSR_Suite)
	- Hugging Face Dataset: [TencentARC/DSR_Suite-Data](https://huggingface.co/datasets/TencentARC/DSR_Suite-Data)
	- Hugging Face Collection: [TencentARC/dsr-suite](https://huggingface.co/collections/TencentARC/dsr-suite)

	## Usage and Evaluation
	For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official [DSR_Suite GitHub repository](https://github.com/TencentARC/DSR_Suite).

	The evaluation framework is based on [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). An example command for evaluating a trained model (like `Qwen2.5-VL-7B-Instruct-ForVideo-Spatial`) on the `Spatial-Reasoning` task is:

	```bash
	cd VLMEvalKit_mine
	CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning
	```

	## Citation
	If you find our work useful, please consider citing:

	```bibtex
	@misc{zhou2025learning,
	title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models},
	author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi},
	year={2025},
	eprint={2512.20557},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2512.20557},
	}
	```

	## Acknowledgement
	This work builds upon the following projects:
	- [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL): The model codebase we built upon.
	- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): The evaluation framework we built upon.
	- [Grounded SAM2](https://github.com/IDEA-Research/Grounded-SAM-2), [Orient Anything](https://github.com/SpatialVision/Orient-Anything), [π^3](https://github.com/yyfz/Pi3): Models used in our data generation pipeline to extract 3D cues.
	- [Koala-36M](https://github.com/KlingTeam/Koala-36M): The video database we build QAs upon.