File size: 6,439 Bytes
7949556 059d66a 7949556 059d66a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
license: apache-2.0
language:
- en
tags:
- spatial-reasoning
- multimodal
- vision-language
- scene-graph
- reinforcement-learning
base_model: Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
---
# SpatialThinker-7B
<p align="center">
<a href="https://arxiv.org/abs/2511.07403">
<img src="https://img.shields.io/badge/arXiv-2511.07403-b31b1b.svg" alt="arXiv">
</a>
<a href="https://hunarbatra.com/SpatialThinker">
<img src="https://img.shields.io/badge/๐%20Project%20Page-blue.svg" alt="Project Page">
</a>
<a href="https://github.com/hunarbatra/SpatialThinker">
<img src="https://img.shields.io/badge/GitHub-Repository-black.svg" alt="GitHub">
</a>
</p>
**SpatialThinker-7B** is a 3D-aware multimodal large language model (MLLM) trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards.
## Model Description
- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards
- **Training Data**: STVQA-7K (7,587 spatial VQA samples)
- **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
- **Institutions**: University of Oxford, UC Santa Cruz
## Key Features
- **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations
- **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding
- **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence
- **Outperforms GPT-4o**: On spatial understanding benchmarks while using only 7K training samples
## Inference Template
Use the following template for inference:
```
You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.
Image size: {Width} x {Height}
```
## Output Format
The model generates structured output with four components:
1. **`<observe>`**: Scene description covering relevant objects
2. **`<scene>`**: JSON scene graph with objects (id, bbox) and relationships (subject, predicate, object)
3. **`<think>`**: Step-by-step reasoning as internal monologue
4. **`<answer>`**: Final answer with option letter and text
### Example Output
```
<observe>
The image shows a living room with a couch, a coffee table, and a cat sitting on the floor.
</observe>
<scene>
{
"objects": [
{"id": "couch.1", "bbox": [50, 100, 400, 350]},
{"id": "cat.1", "bbox": [200, 300, 280, 400]},
{"id": "table.1", "bbox": [150, 250, 350, 320]}
],
"relationships": [
{"subject": "cat.1", "predicate": "in front of", "object": "couch.1"},
{"subject": "cat.1", "predicate": "beside", "object": "table.1"}
]
}
</scene>
<think>
Looking at the scene graph, the cat is positioned in front of the couch and beside the coffee table. The bounding box coordinates show the cat is at y=300-400 while the couch extends to y=350, confirming the cat is on the floor in front of the couch.
</think>
<answer> (B) in front of the couch </answer>
```
## Usage
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"OX-PIXL/SpatialThinker-7B",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("OX-PIXL/SpatialThinker-7B")
# Load image
image = Image.open("your_image.jpg")
width, height = image.size
# Prepare prompt with template
template = f"""You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.
Image size: {width} x {height}"""
question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside"
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": template + "\n\n" + question},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
```
## Evaluation Results
SpatialThinker-7B achieves state-of-the-art performance on spatial reasoning benchmarks:
| Benchmark | SpatialThinker-7B |
|-----------|------------------------|
| CV-Bench (3D) | Strong performance |
| BLINK-Spatial | Outperforms GPT-4o |
| SpatialBench | SOTA results |
| RealWorldQA | Competitive |
See the [paper](https://arxiv.org/abs/2511.07403) for detailed results.
## Citation
```bibtex
@misc{batra2025spatialthinkerreinforcing3dreasoning,
title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards},
author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},
year={2025},
eprint={2511.07403},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.07403},
}
```
## Links
- ๐ **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403)
- ๐ **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker)
- ๐ป **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker)
- ๐ค **Dataset**: [OX-PIXL/STVQA-7K](https://huggingface.co/datasets/OX-PIXL/STVQA-7K)
|