--- license: apache-2.0 language: - en tags: - spatial-reasoning - multimodal - vision-language - scene-graph - reinforcement-learning base_model: Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text --- # SpatialThinker-7B

arXiv Project Page GitHub

**SpatialThinker-7B** is a 3D-aware multimodal large language model (MLLM) trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. ## Model Description - **Base Model**: Qwen2.5-VL-7B-Instruct - **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards - **Training Data**: STVQA-7K (7,587 spatial VQA samples) - **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark - **Institutions**: University of Oxford, UC Santa Cruz ## Key Features - **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations - **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding - **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence - **Outperforms GPT-4o**: On spatial understanding benchmarks while using only 7K training samples ## Inference Template Use the following template for inference: ``` You FIRST observe the image in tags, then visualise the relevant scene graph in tags, followed by thinking about the reasoning process as an internal monologue within tags and then provide the final answer. The final answer MUST BE put within tags, and only return the final choice including the correct option and answer within the answer tags, e.g., (A) cat . Image size: {Width} x {Height} ``` ## Output Format The model generates structured output with four components: 1. **``**: Scene description covering relevant objects 2. **``**: JSON scene graph with objects (id, bbox) and relationships (subject, predicate, object) 3. **``**: Step-by-step reasoning as internal monologue 4. **``**: Final answer with option letter and text ### Example Output ``` The image shows a living room with a couch, a coffee table, and a cat sitting on the floor. { "objects": [ {"id": "couch.1", "bbox": [50, 100, 400, 350]}, {"id": "cat.1", "bbox": [200, 300, 280, 400]}, {"id": "table.1", "bbox": [150, 250, 350, 320]} ], "relationships": [ {"subject": "cat.1", "predicate": "in front of", "object": "couch.1"}, {"subject": "cat.1", "predicate": "beside", "object": "table.1"} ] } Looking at the scene graph, the cat is positioned in front of the couch and beside the coffee table. The bounding box coordinates show the cat is at y=300-400 while the couch extends to y=350, confirming the cat is on the floor in front of the couch. (B) in front of the couch ``` ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from PIL import Image model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "OX-PIXL/SpatialThinker-7B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("OX-PIXL/SpatialThinker-7B") # Load image image = Image.open("your_image.jpg") width, height = image.size # Prepare prompt with template template = f"""You FIRST observe the image in tags, then visualise the relevant scene graph in tags, followed by thinking about the reasoning process as an internal monologue within tags and then provide the final answer. The final answer MUST BE put within tags, and only return the final choice including the correct option and answer within the answer tags, e.g., (A) cat . Image size: {width} x {height}""" question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside" messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": template + "\n\n" + question}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=1024) output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(output) ``` ## Evaluation Results SpatialThinker-7B achieves state-of-the-art performance on spatial reasoning benchmarks: | Benchmark | SpatialThinker-7B | |-----------|------------------------| | CV-Bench (3D) | Strong performance | | BLINK-Spatial | Outperforms GPT-4o | | SpatialBench | SOTA results | | RealWorldQA | Competitive | See the [paper](https://arxiv.org/abs/2511.07403) for detailed results. ## Citation ```bibtex @misc{batra2025spatialthinkerreinforcing3dreasoning, title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark}, year={2025}, eprint={2511.07403}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.07403}, } ``` ## Links - 📄 **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403) - 🌐 **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker) - 💻 **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker) - 🤗 **Dataset**: [OX-PIXL/STVQA-7K](https://huggingface.co/datasets/OX-PIXL/STVQA-7K)