Safetensors
qwen2_5_vl

Improve model card: Add tags, detailed description, usage, and citation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +152 -3
README.md CHANGED
@@ -1,7 +1,156 @@
1
  ---
2
- datasets:
3
- - OX-PIXL/STVQA-7K
4
  base_model:
5
  - Qwen/Qwen2.5-VL-3B-Instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
- Paper: https://arxiv.org/abs/2511.07403
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-3B-Instruct
4
+ datasets:
5
+ - OX-PIXL/STVQA-7K
6
+ pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
+ ---
9
+
10
+ # SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
11
+
12
+ The `SpatialThinker` model, presented in the paper [SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards](https://huggingface.co/papers/2511.07403), is a 3D-aware Multimodal Large Language Model (MLLM) trained with Reinforcement Learning (RL) to integrate structured spatial grounding with multi-step reasoning.
13
+
14
+ **Paper (ArXiv)**: [https://arxiv.org/abs/2511.07403](https://arxiv.org/abs/2511.07403)
15
+ **Project Page**: [https://hunarbatra.com/SpatialThinker/](https://hunarbatra.com/SpatialThinker/)
16
+ **Code**: [https://github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker)
17
+
18
+ ## Abstract
19
+ Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
20
+
21
+ <p align="center">
22
+ <img src="https://github.com/hunarbatra/SpatialThinker/raw/main/assets/spatialthinker.jpg" width="60%" alt="SpatialThinker Overview">
23
+ </p>
24
+
25
+ ---
26
+
27
+ ### 🧩 Requirements
28
+
29
+ - Python 3.9+
30
+ - `transformers >= 4.49.0`
31
+ - `flash-attn >= 2.4.3`
32
+ - `vllm >= 0.7.3` (0.8.0 recommended)
33
+
34
+ ---
35
+
36
+ ### ⚙️ Installation
37
+
38
+ ```bash
39
+ pip install -e .
40
+ ```
41
+
42
+ ---
43
+
44
+ ### 🚀 Training
45
+
46
+ #### Train **SpatialThinker Models** with STVQA-7K, Dense Spatial Rewards + GRPO
47
+
48
+ ```bash
49
+ bash scripts/spatialthinker_3b_grpo.sh
50
+ ```
51
+ ```bash
52
+ bash scripts/spatialthinker_7b_grpo.sh
53
+ ```
54
+
55
+ #### Train **Baseline Models** (Vanilla GRPO) with STVQA-7K
56
+
57
+ ```bash
58
+ bash scripts/qwen_2_5_3b_stvqa_vanilla_grpo.sh
59
+ ```
60
+ ```bash
61
+ bash scripts/qwen_2_5_7b_stvqa_vanilla_grpo.sh
62
+ ```
63
+
64
+ ---
65
+
66
+ ### 🧠 Merge Checkpoints to Hugging Face Format
67
+ ```bash
68
+ python3 scripts/model_merger.py --local_dir path_to_your_last_actor_checkpoint
69
+ ```
70
+ ---
71
+
72
+ ### 🧪 Evaluation
73
+
74
+ To evaluate **SpatialThinker** or baseline models across spatial reasoning benchmarks, use the provided `evaluation/eval.py` script.
75
+
76
+ #### Basic Command Structure
77
+ ```bash
78
+ python3 evaluation/eval.py \
79
+ --dataset <dataset_name> \
80
+ --template <prompt_template> \ # e.g. `reasoning`, `no_reasoning`, `spatial_thinker`
81
+ --model_path <model_or_checkpoint> \
82
+ --cuda <gpu_id> \
83
+ --batch_size <num_samples_per_step> \
84
+ [--provider <inference_backend>] \
85
+ [--processor_name <tokenizer_or_processor>] \
86
+ [--custom_filename <output_name>]
87
+ ```
88
+
89
+ #### ⚙️ Example: Evaluate Across Multiple Benchmarks
90
+
91
+ ```bash
92
+ python3 evaluation/eval.py \
93
+ --dataset blink-spatial \
94
+ --template spatial_thinker \
95
+ --model_path OX-PIXL/SpatialThinker-3B \
96
+ --cuda 0 \
97
+ --batch_size 4
98
+ ```
99
+ ```bash
100
+ python3 evaluation/eval.py \
101
+ --dataset spatialbench \
102
+ --template spatial_thinker \
103
+ --model_path OX-PIXL/SpatialThinker-3B \
104
+ --cuda 0 \
105
+ --batch_size 2
106
+ ```
107
+
108
+ #### 📊 Example: Evaluate Using an API Provider (OpenAI / Anthropic)
109
+
110
+ ```bash
111
+ python3 evaluation/eval.py \
112
+ --dataset stvqa \
113
+ --template reasoning \
114
+ --model_path gpt-4o-2024-05-13 \
115
+ --provider openai \
116
+ --batch_size 1
117
+ ```
118
+ ```bash
119
+ python3 evaluation/eval.py \
120
+ --dataset stvqa \
121
+ --template reasoning \
122
+ --model_path claude-3-5-sonnet \
123
+ --provider anthropic \
124
+ --batch_size 1
125
+ ```
126
+
127
+ #### Supported Evaluation Datasets
128
+ `cv-bench`, `cv-bench-2D`, `cv-bench-3D`, `blink-spatial`, `blink-depth`, `blink-object`,
129
+ `blink-counting`, `blink-multi-view`, `blink-jigsaw`, `realworld_qa`, `spatialbench`, `mmvp`, `3dsrbench`,
130
+ `lego`, `spatialreasoner`, `robospatial`, `robospatial_rgb`, `stvqa`, `hallusionbench`.
131
+
132
+ ---
133
+ ### 📘 Citation
134
+
135
+ If you find this repository useful in your project, please consider giving a ⭐ and citing:
136
+
137
+ ```bibtex
138
+ @misc{batra2025spatialthinkerreinforcing3dreasoning,
139
+  title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards},
140
+  author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},
141
+  year={2025},
142
+  eprint={2511.07403},
143
+  archivePrefix={arXiv},
144
+  primaryClass={cs.CV},
145
+  url={https://arxiv.org/abs/2511.07403},
146
+ }
147
+ ```
148
  ---
149
+
150
+ ### 🌟 Acknowledgements
151
+ This project builds upon the following open-source frameworks and works:
152
+ - [**EasyR1**](https://github.com/hiyouga/EasyR1) — An efficient, scalable, multi-modality RL training framework based on veRL
153
+ - [**LLaMA-Factory**](https://github.com/hunarbatra/LLaMA-Factory) — Unified efficient fine-tuning of 100+ LLMs & VLMs
154
+ - [**Qwen2.5-VL**](https://arxiv.org/abs/2502.13923) — Multimodal LLM series from the Qwen family
155
+
156
+ ---