File size: 5,889 Bytes
6a2ec11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
license: apache-2.0
language:
- en
tags:
- spatial-reasoning
- multimodal
- vision-language
- scene-graph
- reinforcement-learning
- mixture-of-experts
base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
pipeline_tag: image-text-to-text
---
# SpatialThinker-30B
<p align="center">
<a href="https://arxiv.org/abs/2511.07403">
<img src="https://img.shields.io/badge/arXiv-2511.07403-b31b1b.svg" alt="arXiv">
</a>
<a href="https://hunarbatra.com/SpatialThinker">
<img src="https://img.shields.io/badge/π%20Project%20Page-blue.svg" alt="Project Page">
</a>
<a href="https://github.com/hunarbatra/SpatialThinker">
<img src="https://img.shields.io/badge/GitHub-Repository-black.svg" alt="GitHub">
</a>
</p>
**SpatialThinker-30B** is a 30B-parameter Mixture-of-Experts (3B active) multimodal large language model trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. It scales the SpatialThinker method to the Qwen3-VL-30B-A3B-Instruct base, retaining the same training recipe: a four-tag scene-graph reasoning format and a dense spatial reward over format, count, accuracy, and grounding.
## Model Description
- **Base Model**: Qwen3-VL-30B-A3B-Instruct (Mixture-of-Experts; ~3B active parameters)
- **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards via Thinking Machines' Tinker
- **Training Data**: STVQA-7K (7,587 spatial VQA samples)
- **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
- **Institutions**: University of Oxford, UC Santa Cruz
## Key Features
- **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations
- **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding
- **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence
- **MoE Efficiency**: 30B total parameters with only ~3B active per token β comparable quality to dense 30B models at a fraction of the compute
## Inference Template
Same four-tag format as SpatialThinker-7B:
```
You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.
Image size: {Width} x {Height}
```
## Usage
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
model = Qwen3VLForConditionalGeneration.from_pretrained(
"hunarbatra/SpatialThinker-30B",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("hunarbatra/SpatialThinker-30B")
# Load image
image = Image.open("your_image.jpg")
width, height = image.size
# Prepare prompt with template
template = f"""You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>.
Image size: {width} x {height}"""
question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside"
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": template + "\n\n" + question},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
```
## Training Details
- **Framework**: Thinking Machines' [Tinker](https://thinkingmachines.ai/tinker/) (LoRA on remote H100 cluster)
- **Steps**: 75
- **Batch size**: 16 prompts Γ 8 rollouts = 128 generations/step
- **Optimizer**: AdamW, lr=1e-6, KL coefficient=1e-2 (low_var_kl)
- **LoRA**: rank=64 on the language tower
The model was trained with several rollout-side fixes that lift the Qwen3-VL-Instruct base's format-pass rate from ~78% to ~96% during training:
- Forced `<observe>\n` assistant prefix (matches the four-tag schema the model is trained to produce)
- Postprocess rewrites for `<tool_call>` β `<think>` (the Instruct base's tool-use prior occasionally leaks)
- Repairs for orphan/unclosed `<think>` tags
## Citation
```bibtex
@misc{batra2025spatialthinkerreinforcing3dreasoning,
title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards},
author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},
year={2025},
eprint={2511.07403},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.07403},
}
```
## Links
- π **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403)
- π **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker)
- π» **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker)
- π€ **Dataset**: [hunarbatra/STVQA-7K](https://huggingface.co/datasets/hunarbatra/STVQA-7K)
- π€ **7B variant**: [hunarbatra/SpatialThinker-7B](https://huggingface.co/hunarbatra/SpatialThinker-7B)
|