--- license: apache-2.0 language: - en tags: - spatial-reasoning - multimodal - vision-language - scene-graph - reinforcement-learning base_model: Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text --- # SpatialThinker-7B

arXiv Project Page GitHub

**SpatialThinker-7B** is a 3D-aware multimodal large language model (MLLM) trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. ## Model Description - **Base Model**: Qwen2.5-VL-7B-Instruct - **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards - **Training Data**: STVQA-7K (7,587 spatial VQA samples) - **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark - **Institutions**: University of Oxford, UC Santa Cruz ## Key Features - **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations - **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding - **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence - **Outperforms GPT-4o**: On spatial understanding benchmarks while using only 7K training samples ## Inference Template Use the following template for inference: ``` You FIRST observe the image in tags, then visualise the relevant scene graph in tags, followed by thinking about the reasoning process as an internal monologue within tags and then provide the final answer. The final answer MUST BE put within tags, and only return the final choice including the correct option and answer within the answer tags, e.g., (A) cat . Image size: {Width} x {Height} ``` ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from PIL import Image model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "OX-PIXL/SpatialThinker-7B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("OX-PIXL/SpatialThinker-7B") # Load image image = Image.open("your_image.jpg") width, height = image.size # Prepare prompt with template template = f"""You FIRST observe the image in tags, then visualise the relevant scene graph in tags, followed by thinking about the reasoning process as an internal monologue within tags and then provide the final answer. The final answer MUST BE put within tags, and only return the final choice including the correct option and answer within the answer tags, e.g., (A) cat . Image size: {width} x {height}""" question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside" messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": template + "\n\n" + question}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=1024) output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(output) ``` ## Citation ```bibtex @misc{batra2025spatialthinkerreinforcing3dreasoning, title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark}, year={2025}, eprint={2511.07403}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.07403}, } ``` ## Links - 📄 **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403) - 🌐 **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker) - 💻 **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker) - 🤗 **Dataset**: [OX-PIXL/STVQA-7K](https://huggingface.co/datasets/OX-PIXL/STVQA-7K)