|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
datasets: |
|
|
- OX-PIXL/STVQA-7K |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards |
|
|
|
|
|
The `SpatialThinker` model, presented in the paper [SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards](https://huggingface.co/papers/2511.07403), is a 3D-aware Multimodal Large Language Model (MLLM) trained with Reinforcement Learning (RL) to integrate structured spatial grounding with multi-step reasoning. |
|
|
|
|
|
**Paper (ArXiv)**: [https://arxiv.org/abs/2511.07403](https://arxiv.org/abs/2511.07403) |
|
|
**Project Page**: [https://hunarbatra.com/SpatialThinker/](https://hunarbatra.com/SpatialThinker/) |
|
|
**Code**: [https://github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker) |
|
|
|
|
|
## Abstract |
|
|
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning. |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/hunarbatra/SpatialThinker/raw/main/assets/spatialthinker.jpg" width="60%" alt="SpatialThinker Overview"> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
### 🧩 Requirements |
|
|
|
|
|
- Python 3.9+ |
|
|
- `transformers >= 4.49.0` |
|
|
- `flash-attn >= 2.4.3` |
|
|
- `vllm >= 0.7.3` (0.8.0 recommended) |
|
|
|
|
|
--- |
|
|
|
|
|
### ⚙️ Installation |
|
|
|
|
|
```bash |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### 🚀 Training |
|
|
|
|
|
#### Train **SpatialThinker Models** with STVQA-7K, Dense Spatial Rewards + GRPO |
|
|
|
|
|
```bash |
|
|
bash scripts/spatialthinker_3b_grpo.sh |
|
|
``` |
|
|
```bash |
|
|
bash scripts/spatialthinker_7b_grpo.sh |
|
|
``` |
|
|
|
|
|
#### Train **Baseline Models** (Vanilla GRPO) with STVQA-7K |
|
|
|
|
|
```bash |
|
|
bash scripts/qwen_2_5_3b_stvqa_vanilla_grpo.sh |
|
|
``` |
|
|
```bash |
|
|
bash scripts/qwen_2_5_7b_stvqa_vanilla_grpo.sh |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### 🧠 Merge Checkpoints to Hugging Face Format |
|
|
```bash |
|
|
python3 scripts/model_merger.py --local_dir path_to_your_last_actor_checkpoint |
|
|
``` |
|
|
--- |
|
|
|
|
|
### 🧪 Evaluation |
|
|
|
|
|
To evaluate **SpatialThinker** or baseline models across spatial reasoning benchmarks, use the provided `evaluation/eval.py` script. |
|
|
|
|
|
#### Basic Command Structure |
|
|
```bash |
|
|
python3 evaluation/eval.py \ |
|
|
--dataset <dataset_name> \ |
|
|
--template <prompt_template> \ # e.g. `reasoning`, `no_reasoning`, `spatial_thinker` |
|
|
--model_path <model_or_checkpoint> \ |
|
|
--cuda <gpu_id> \ |
|
|
--batch_size <num_samples_per_step> \ |
|
|
[--provider <inference_backend>] \ |
|
|
[--processor_name <tokenizer_or_processor>] \ |
|
|
[--custom_filename <output_name>] |
|
|
``` |
|
|
|
|
|
#### ⚙️ Example: Evaluate Across Multiple Benchmarks |
|
|
|
|
|
```bash |
|
|
python3 evaluation/eval.py \ |
|
|
--dataset blink-spatial \ |
|
|
--template spatial_thinker \ |
|
|
--model_path OX-PIXL/SpatialThinker-3B \ |
|
|
--cuda 0 \ |
|
|
--batch_size 4 |
|
|
``` |
|
|
```bash |
|
|
python3 evaluation/eval.py \ |
|
|
--dataset spatialbench \ |
|
|
--template spatial_thinker \ |
|
|
--model_path OX-PIXL/SpatialThinker-3B \ |
|
|
--cuda 0 \ |
|
|
--batch_size 2 |
|
|
``` |
|
|
|
|
|
#### 📊 Example: Evaluate Using an API Provider (OpenAI / Anthropic) |
|
|
|
|
|
```bash |
|
|
python3 evaluation/eval.py \ |
|
|
--dataset stvqa \ |
|
|
--template reasoning \ |
|
|
--model_path gpt-4o-2024-05-13 \ |
|
|
--provider openai \ |
|
|
--batch_size 1 |
|
|
``` |
|
|
```bash |
|
|
python3 evaluation/eval.py \ |
|
|
--dataset stvqa \ |
|
|
--template reasoning \ |
|
|
--model_path claude-3-5-sonnet \ |
|
|
--provider anthropic \ |
|
|
--batch_size 1 |
|
|
``` |
|
|
|
|
|
#### Supported Evaluation Datasets |
|
|
`cv-bench`, `cv-bench-2D`, `cv-bench-3D`, `blink-spatial`, `blink-depth`, `blink-object`, |
|
|
`blink-counting`, `blink-multi-view`, `blink-jigsaw`, `realworld_qa`, `spatialbench`, `mmvp`, `3dsrbench`, |
|
|
`lego`, `spatialreasoner`, `robospatial`, `robospatial_rgb`, `stvqa`, `hallusionbench`. |
|
|
|
|
|
--- |
|
|
### 📘 Citation |
|
|
|
|
|
If you find this repository useful in your project, please consider giving a ⭐ and citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{batra2025spatialthinkerreinforcing3dreasoning, |
|
|
title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, |
|
|
author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark}, |
|
|
year={2025}, |
|
|
eprint={2511.07403}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2511.07403}, |
|
|
} |
|
|
``` |
|
|
--- |
|
|
|
|
|
### 🌟 Acknowledgements |
|
|
This project builds upon the following open-source frameworks and works: |
|
|
- [**EasyR1**](https://github.com/hiyouga/EasyR1) — An efficient, scalable, multi-modality RL training framework based on veRL |
|
|
- [**LLaMA-Factory**](https://github.com/hunarbatra/LLaMA-Factory) — Unified efficient fine-tuning of 100+ LLMs & VLMs |
|
|
- [**Qwen2.5-VL**](https://arxiv.org/abs/2502.13923) — Multimodal LLM series from the Qwen family |
|
|
|
|
|
--- |