Add model card and metadata for DSR Suite model
Browse filesHi! I'm Niels from the Hugging Face community science team. I've opened this PR to add a comprehensive model card and metadata for your DSR Suite model.
This PR:
- Adds the `video-text-to-text` pipeline tag for better discoverability.
- Adds the `library_name: transformers` tag as indicated by the `config.json` and `tokenizer_config.json` files, enabling automated code snippets.
- Links the model to the paper [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557).
- Includes links to the project's GitHub repository, the associated Hugging Face dataset, and collection.
- Provides an introduction summarizing the model's capabilities in dynamic spatial reasoning and details on its usage for evaluation.
- Includes the correct BibTeX citation and acknowledgements.
Please review and merge if this looks good to you!
|
@@ -1,4 +1,63 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
extra_gated_eu_disallowed: true
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
extra_gated_eu_disallowed: true
|
| 4 |
+
pipeline_tag: video-text-to-text
|
| 5 |
+
library_name: transformers
|
| 6 |
+
base_model: Qwen/Qwen2.5-VL-7B-Instruct
|
| 7 |
+
tags:
|
| 8 |
+
- spatial-reasoning
|
| 9 |
+
- 4d-vision
|
| 10 |
+
- vlm
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
|
| 14 |
+
|
| 15 |
+
This repository contains the model weights for the **DSR Suite**, which introduces advancements in dynamic spatial reasoning for Vision Language Models (VLMs), as presented in the paper [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557).
|
| 16 |
+
|
| 17 |
+
## Introduction
|
| 18 |
+
Vision-language models (VLMs) typically excel at general understanding but demonstrate weaknesses in **Dynamic Spatial Reasoning (DSR)** – the ability to reason about the evolution of object geometry and relationships in 3D space over time. To address this gap, we introduce **DSR Suite**, which comprises:
|
| 19 |
+
|
| 20 |
+
1. **Automated Data Generation Pipeline**: A system that constructs multiple-choice question-answer pairs from in-the-wild videos for DSR.
|
| 21 |
+
2. **DSR-Train**: A training dataset of 50K QAs generated by the pipeline.
|
| 22 |
+
3. **DSR-Bench**: A human-refined benchmark with 1484 QAs for rigorous evaluation.
|
| 23 |
+
4. **Geometry Selection Module (GSM)**: A lightweight module designed to seamlessly integrate geometric priors from 3D foundation models into VLMs, specifically a **Qwen2.5-VL-7B** backbone, without compromising general understanding capabilities.
|
| 24 |
+
|
| 25 |
+
Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning.
|
| 26 |
+
|
| 27 |
+
## Resources
|
| 28 |
+
- **Paper**: [Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models](https://huggingface.co/papers/2512.20557)
|
| 29 |
+
- **GitHub Repository**: [https://github.com/TencentARC/DSR_Suite](https://github.com/TencentARC/DSR_Suite)
|
| 30 |
+
- **Hugging Face Dataset**: [TencentARC/DSR_Suite-Data](https://huggingface.co/datasets/TencentARC/DSR_Suite-Data)
|
| 31 |
+
- **Hugging Face Collection**: [TencentARC/dsr-suite](https://huggingface.co/collections/TencentARC/dsr-suite)
|
| 32 |
+
|
| 33 |
+
## Usage and Evaluation
|
| 34 |
+
For detailed instructions on environment setup, data generation, model training, and benchmark evaluation, please refer to the official [DSR_Suite GitHub repository](https://github.com/TencentARC/DSR_Suite).
|
| 35 |
+
|
| 36 |
+
The evaluation framework is based on [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). An example command for evaluating a trained model (like `Qwen2.5-VL-7B-Instruct-ForVideo-Spatial`) on the `Spatial-Reasoning` task is:
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
cd VLMEvalKit_mine
|
| 40 |
+
CUDA_VISIBLE_DEVICES=0 python run.py --data Spatial-Reasoning --model Qwen2.5-VL-7B-Instruct-ForVideo-Spatial --work-dir spatial_reasoning
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
## Citation
|
| 44 |
+
If you find our work useful, please consider citing:
|
| 45 |
+
|
| 46 |
+
```bibtex
|
| 47 |
+
@misc{zhou2025learning,
|
| 48 |
+
title={Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models},
|
| 49 |
+
author={Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi},
|
| 50 |
+
year={2025},
|
| 51 |
+
eprint={2512.20557},
|
| 52 |
+
archivePrefix={arXiv},
|
| 53 |
+
primaryClass={cs.CV},
|
| 54 |
+
url={https://arxiv.org/abs/2512.20557},
|
| 55 |
+
}
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Acknowledgement
|
| 59 |
+
This work builds upon the following projects:
|
| 60 |
+
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL): The model codebase we built upon.
|
| 61 |
+
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): The evaluation framework we built upon.
|
| 62 |
+
- [Grounded SAM2](https://github.com/IDEA-Research/Grounded-SAM-2), [Orient Anything](https://github.com/SpatialVision/Orient-Anything), [π^3](https://github.com/yyfz/Pi3): Models used in our data generation pipeline to extract 3D cues.
|
| 63 |
+
- [Koala-36M](https://github.com/KlingTeam/Koala-36M): The video database we build QAs upon.
|