File size: 5,889 Bytes
6a2ec11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
language:
- en
tags:
- spatial-reasoning
- multimodal
- vision-language
- scene-graph
- reinforcement-learning
- mixture-of-experts
base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
pipeline_tag: image-text-to-text
---

# SpatialThinker-30B

<p align="center">
  <a href="https://arxiv.org/abs/2511.07403">
    <img src="https://img.shields.io/badge/arXiv-2511.07403-b31b1b.svg" alt="arXiv">
  </a>
  <a href="https://hunarbatra.com/SpatialThinker">
    <img src="https://img.shields.io/badge/🌐%20Project%20Page-blue.svg" alt="Project Page">
  </a>
  <a href="https://github.com/hunarbatra/SpatialThinker">
    <img src="https://img.shields.io/badge/GitHub-Repository-black.svg" alt="GitHub">
  </a>
</p>

**SpatialThinker-30B** is a 30B-parameter Mixture-of-Experts (3B active) multimodal large language model trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. It scales the SpatialThinker method to the Qwen3-VL-30B-A3B-Instruct base, retaining the same training recipe: a four-tag scene-graph reasoning format and a dense spatial reward over format, count, accuracy, and grounding.

## Model Description

- **Base Model**: Qwen3-VL-30B-A3B-Instruct (Mixture-of-Experts; ~3B active parameters)
- **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards via Thinking Machines' Tinker
- **Training Data**: STVQA-7K (7,587 spatial VQA samples)
- **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
- **Institutions**: University of Oxford, UC Santa Cruz

## Key Features

- **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations
- **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding
- **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence
- **MoE Efficiency**: 30B total parameters with only ~3B active per token β€” comparable quality to dense 30B models at a fraction of the compute

## Inference Template

Same four-tag format as SpatialThinker-7B:

```
You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.

Image size: {Width} x {Height}
```

## Usage

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "hunarbatra/SpatialThinker-30B",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("hunarbatra/SpatialThinker-30B")

# Load image
image = Image.open("your_image.jpg")
width, height = image.size

# Prepare prompt with template
template = f"""You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.

Image size: {width} x {height}"""

question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": template + "\n\n" + question},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
```

## Training Details

- **Framework**: Thinking Machines' [Tinker](https://thinkingmachines.ai/tinker/) (LoRA on remote H100 cluster)
- **Steps**: 75
- **Batch size**: 16 prompts Γ— 8 rollouts = 128 generations/step
- **Optimizer**: AdamW, lr=1e-6, KL coefficient=1e-2 (low_var_kl)
- **LoRA**: rank=64 on the language tower

The model was trained with several rollout-side fixes that lift the Qwen3-VL-Instruct base's format-pass rate from ~78% to ~96% during training:
- Forced `<observe>\n` assistant prefix (matches the four-tag schema the model is trained to produce)
- Postprocess rewrites for `<tool_call>` β†’ `<think>` (the Instruct base's tool-use prior occasionally leaks)
- Repairs for orphan/unclosed `<think>` tags

## Citation

```bibtex
@misc{batra2025spatialthinkerreinforcing3dreasoning,
  title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards},
  author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},
  year={2025},
  eprint={2511.07403},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2511.07403},
}
```

## Links

- πŸ“„ **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403)
- 🌐 **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker)
- πŸ’» **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker)
- πŸ€— **Dataset**: [hunarbatra/STVQA-7K](https://huggingface.co/datasets/hunarbatra/STVQA-7K)
- πŸ€— **7B variant**: [hunarbatra/SpatialThinker-7B](https://huggingface.co/hunarbatra/SpatialThinker-7B)