File size: 6,439 Bytes
7949556
059d66a
 
 
 
 
 
 
 
 
 
 
7949556
 
059d66a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: apache-2.0
language:
- en
tags:
- spatial-reasoning
- multimodal
- vision-language
- scene-graph
- reinforcement-learning
base_model: Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
---

# SpatialThinker-7B

<p align="center">
  <a href="https://arxiv.org/abs/2511.07403">
    <img src="https://img.shields.io/badge/arXiv-2511.07403-b31b1b.svg" alt="arXiv">
  </a>
  <a href="https://hunarbatra.com/SpatialThinker">
    <img src="https://img.shields.io/badge/๐ŸŒ%20Project%20Page-blue.svg" alt="Project Page">
  </a>
  <a href="https://github.com/hunarbatra/SpatialThinker">
    <img src="https://img.shields.io/badge/GitHub-Repository-black.svg" alt="GitHub">
  </a>
</p>

**SpatialThinker-7B** is a 3D-aware multimodal large language model (MLLM) trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards.

## Model Description

- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards
- **Training Data**: STVQA-7K (7,587 spatial VQA samples)
- **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
- **Institutions**: University of Oxford, UC Santa Cruz

## Key Features

- **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations
- **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding
- **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence
- **Outperforms GPT-4o**: On spatial understanding benchmarks while using only 7K training samples

## Inference Template

Use the following template for inference:

```
You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.

Image size: {Width} x {Height}
```

## Output Format

The model generates structured output with four components:

1. **`<observe>`**: Scene description covering relevant objects
2. **`<scene>`**: JSON scene graph with objects (id, bbox) and relationships (subject, predicate, object)
3. **`<think>`**: Step-by-step reasoning as internal monologue
4. **`<answer>`**: Final answer with option letter and text

### Example Output

```
<observe>
The image shows a living room with a couch, a coffee table, and a cat sitting on the floor.
</observe>
<scene>
{
  "objects": [
    {"id": "couch.1", "bbox": [50, 100, 400, 350]},
    {"id": "cat.1", "bbox": [200, 300, 280, 400]},
    {"id": "table.1", "bbox": [150, 250, 350, 320]}
  ],
  "relationships": [
    {"subject": "cat.1", "predicate": "in front of", "object": "couch.1"},
    {"subject": "cat.1", "predicate": "beside", "object": "table.1"}
  ]
}
</scene>
<think>
Looking at the scene graph, the cat is positioned in front of the couch and beside the coffee table. The bounding box coordinates show the cat is at y=300-400 while the couch extends to y=350, confirming the cat is on the floor in front of the couch.
</think>
<answer> (B) in front of the couch </answer>
```

## Usage

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "OX-PIXL/SpatialThinker-7B",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("OX-PIXL/SpatialThinker-7B")

# Load image
image = Image.open("your_image.jpg")
width, height = image.size

# Prepare prompt with template
template = f"""You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.

Image size: {width} x {height}"""

question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": template + "\n\n" + question},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
```

## Evaluation Results

SpatialThinker-7B achieves state-of-the-art performance on spatial reasoning benchmarks:

| Benchmark | SpatialThinker-7B |
|-----------|------------------------|
| CV-Bench (3D) | Strong performance |
| BLINK-Spatial | Outperforms GPT-4o |
| SpatialBench | SOTA results |
| RealWorldQA | Competitive |

See the [paper](https://arxiv.org/abs/2511.07403) for detailed results.

## Citation

```bibtex
@misc{batra2025spatialthinkerreinforcing3dreasoning,  
  title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards},  
  author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},  
  year={2025},  
  eprint={2511.07403},  
  archivePrefix={arXiv},  
  primaryClass={cs.CV},  
  url={https://arxiv.org/abs/2511.07403},  
}
```

## Links

- ๐Ÿ“„ **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403)
- ๐ŸŒ **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker)
- ๐Ÿ’ป **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker)
- ๐Ÿค— **Dataset**: [OX-PIXL/STVQA-7K](https://huggingface.co/datasets/OX-PIXL/STVQA-7K)