SpatialThinker-3B / README.md

nielsr HF Staff

Improve model card: Add tags, detailed description, usage, and citation

d771b60 verified 3 months ago

preview code

raw

history blame

5.84 kB

metadata

base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
datasets:
  - OX-PIXL/STVQA-7K
pipeline_tag: image-text-to-text
library_name: transformers

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

The SpatialThinker model, presented in the paper SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards, is a 3D-aware Multimodal Large Language Model (MLLM) trained with Reinforcement Learning (RL) to integrate structured spatial grounding with multi-step reasoning.

Paper (ArXiv): https://arxiv.org/abs/2511.07403 Project Page: https://hunarbatra.com/SpatialThinker/ Code: https://github.com/hunarbatra/SpatialThinker

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

SpatialThinker Overview

🧩 Requirements

Python 3.9+
transformers >= 4.49.0
flash-attn >= 2.4.3
vllm >= 0.7.3 (0.8.0 recommended)

⚙️ Installation

pip install -e .

🚀 Training

Train SpatialThinker Models with STVQA-7K, Dense Spatial Rewards + GRPO

bash scripts/spatialthinker_3b_grpo.sh

bash scripts/spatialthinker_7b_grpo.sh

Train Baseline Models (Vanilla GRPO) with STVQA-7K

bash scripts/qwen_2_5_3b_stvqa_vanilla_grpo.sh

bash scripts/qwen_2_5_7b_stvqa_vanilla_grpo.sh

🧠 Merge Checkpoints to Hugging Face Format

python3 scripts/model_merger.py --local_dir path_to_your_last_actor_checkpoint

🧪 Evaluation

To evaluate SpatialThinker or baseline models across spatial reasoning benchmarks, use the provided evaluation/eval.py script.

Basic Command Structure

python3 evaluation/eval.py \
    --dataset <dataset_name> \
    --template <prompt_template> \ # e.g. `reasoning`, `no_reasoning`, `spatial_thinker`  
    --model_path <model_or_checkpoint> \
    --cuda <gpu_id> \
    --batch_size <num_samples_per_step> \
    [--provider <inference_backend>] \ 
    [--processor_name <tokenizer_or_processor>] \
    [--custom_filename <output_name>]

⚙️ Example: Evaluate Across Multiple Benchmarks

python3 evaluation/eval.py \
    --dataset blink-spatial \
    --template spatial_thinker \
    --model_path OX-PIXL/SpatialThinker-3B \
    --cuda 0 \
    --batch_size 4

python3 evaluation/eval.py \
    --dataset spatialbench \
    --template spatial_thinker \
    --model_path OX-PIXL/SpatialThinker-3B \
    --cuda 0 \
    --batch_size 2

📊 Example: Evaluate Using an API Provider (OpenAI / Anthropic)

python3 evaluation/eval.py \
    --dataset stvqa \
    --template reasoning \
    --model_path gpt-4o-2024-05-13 \
    --provider openai \
    --batch_size 1

python3 evaluation/eval.py \
    --dataset stvqa \
    --template reasoning \
    --model_path claude-3-5-sonnet \
    --provider anthropic \
    --batch_size 1

Supported Evaluation Datasets

cv-bench, cv-bench-2D, cv-bench-3D, blink-spatial, blink-depth, blink-object,
blink-counting, blink-multi-view, blink-jigsaw, realworld_qa, spatialbench, mmvp, 3dsrbench, lego, spatialreasoner, robospatial, robospatial_rgb, stvqa, hallusionbench.

📘 Citation

If you find this repository useful in your project, please consider giving a ⭐ and citing:

@misc{batra2025spatialthinkerreinforcing3dreasoning,  
 title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards},  
 author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},  
 year={2025},  
 eprint={2511.07403},  
 archivePrefix={arXiv},  
 primaryClass={cs.CV},  
 url={https://arxiv.org/abs/2511.07403},  
}

🌟 Acknowledgements

This project builds upon the following open-source frameworks and works: - EasyR1 — An efficient, scalable, multi-modality RL training framework based on veRL
- LLaMA-Factory — Unified efficient fine-tuning of 100+ LLMs & VLMs
- Qwen2.5-VL — Multimodal LLM series from the Qwen family