--- license: apache-2.0 language: - en tags: - spatial-reasoning - multimodal - vision-language - scene-graph - reinforcement-learning - mixture-of-experts base_model: Qwen/Qwen3-VL-30B-A3B-Instruct pipeline_tag: image-text-to-text --- # SpatialThinker-30B

arXiv Project Page GitHub

**SpatialThinker-30B** is a 30B-parameter Mixture-of-Experts (3B active) multimodal large language model trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. It scales the SpatialThinker method to the Qwen3-VL-30B-A3B-Instruct base, retaining the same training recipe: a four-tag scene-graph reasoning format and a dense spatial reward over format, count, accuracy, and grounding. ## Model Description - **Base Model**: Qwen3-VL-30B-A3B-Instruct (Mixture-of-Experts; ~3B active parameters) - **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards via Thinking Machines' Tinker - **Training Data**: STVQA-7K (7,587 spatial VQA samples) - **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark - **Institutions**: University of Oxford, UC Santa Cruz ## Key Features - **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations - **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding - **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence - **MoE Efficiency**: 30B total parameters with only ~3B active per token — comparable quality to dense 30B models at a fraction of the compute ## Inference Template Same four-tag format as SpatialThinker-7B: ``` You FIRST observe the image in tags, then visualise the relevant scene graph in tags, followed by thinking about the reasoning process as an internal monologue within tags and then provide the final answer. The final answer MUST BE put within tags, and only return the final choice including the correct option and answer within the answer tags, e.g., (A) cat . Image size: {Width} x {Height} ``` ## Usage ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from PIL import Image model = Qwen3VLForConditionalGeneration.from_pretrained( "hunarbatra/SpatialThinker-30B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("hunarbatra/SpatialThinker-30B") # Load image image = Image.open("your_image.jpg") width, height = image.size # Prepare prompt with template template = f"""You FIRST observe the image in tags, then visualise the relevant scene graph in tags, followed by thinking about the reasoning process as an internal monologue within tags and then provide the final answer. The final answer MUST BE put within tags, and only return the final choice including the correct option and answer within the answer tags, e.g., (A) cat . Image size: {width} x {height}""" question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside" messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": template + "\n\n" + question}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=2048) output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(output) ``` ## Training Details - **Framework**: Thinking Machines' [Tinker](https://thinkingmachines.ai/tinker/) (LoRA on remote H100 cluster) - **Steps**: 75 - **Batch size**: 16 prompts × 8 rollouts = 128 generations/step - **Optimizer**: AdamW, lr=1e-6, KL coefficient=1e-2 (low_var_kl) - **LoRA**: rank=64 on the language tower The model was trained with several rollout-side fixes that lift the Qwen3-VL-Instruct base's format-pass rate from ~78% to ~96% during training: - Forced `\n` assistant prefix (matches the four-tag schema the model is trained to produce) - Postprocess rewrites for `` → `` (the Instruct base's tool-use prior occasionally leaks) - Repairs for orphan/unclosed `` tags ## Citation ```bibtex @misc{batra2025spatialthinkerreinforcing3dreasoning, title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark}, year={2025}, eprint={2511.07403}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.07403}, } ``` ## Links - 📄 **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403) - 🌐 **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker) - 💻 **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker) - 🤗 **Dataset**: [hunarbatra/STVQA-7K](https://huggingface.co/datasets/hunarbatra/STVQA-7K) - 🤗 **7B variant**: [hunarbatra/SpatialThinker-7B](https://huggingface.co/hunarbatra/SpatialThinker-7B)