|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- spatial-reasoning |
|
|
- multimodal |
|
|
- vision-language |
|
|
- scene-graph |
|
|
- reinforcement-learning |
|
|
base_model: Qwen/Qwen2.5-VL-7B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# SpatialThinker-7B |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2511.07403"> |
|
|
<img src="https://img.shields.io/badge/arXiv-2511.07403-b31b1b.svg" alt="arXiv"> |
|
|
</a> |
|
|
<a href="https://hunarbatra.com/SpatialThinker"> |
|
|
<img src="https://img.shields.io/badge/π%20Project%20Page-blue.svg" alt="Project Page"> |
|
|
</a> |
|
|
<a href="https://github.com/hunarbatra/SpatialThinker"> |
|
|
<img src="https://img.shields.io/badge/GitHub-Repository-black.svg" alt="GitHub"> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
**SpatialThinker-7B** is a 3D-aware multimodal large language model (MLLM) trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: Qwen2.5-VL-7B-Instruct |
|
|
- **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards |
|
|
- **Training Data**: STVQA-7K (7,587 spatial VQA samples) |
|
|
- **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark |
|
|
- **Institutions**: University of Oxford, UC Santa Cruz |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations |
|
|
- **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding |
|
|
- **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence |
|
|
- **Outperforms GPT-4o**: On spatial understanding benchmarks while using only 7K training samples |
|
|
|
|
|
## Inference Template |
|
|
|
|
|
Use the following template for inference: |
|
|
|
|
|
``` |
|
|
You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>. |
|
|
|
|
|
Image size: {Width} x {Height} |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from PIL import Image |
|
|
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"OX-PIXL/SpatialThinker-7B", |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained("OX-PIXL/SpatialThinker-7B") |
|
|
|
|
|
# Load image |
|
|
image = Image.open("your_image.jpg") |
|
|
width, height = image.size |
|
|
|
|
|
# Prepare prompt with template |
|
|
template = f"""You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>. |
|
|
|
|
|
Image size: {width} x {height}""" |
|
|
|
|
|
question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside" |
|
|
|
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": template + "\n\n" + question}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) |
|
|
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=1024) |
|
|
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{batra2025spatialthinkerreinforcing3dreasoning, |
|
|
title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, |
|
|
author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark}, |
|
|
year={2025}, |
|
|
eprint={2511.07403}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2511.07403}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- π **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403) |
|
|
- π **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker) |
|
|
- π» **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker) |
|
|
- π€ **Dataset**: [OX-PIXL/STVQA-7K](https://huggingface.co/datasets/OX-PIXL/STVQA-7K) |
|
|
|