| license: apache-2.0 | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| base_model: Qwen/Qwen2.5-VL-3B-Instruct | |
| tags: | |
| - 3d | |
| - spatial-reasoning | |
| - vlm | |
| - qwen2.5-vl | |
| # 3DThinker-Mindcube | |
| This repository contains the stage 1 model checkpoint for **3DThinker**, as presented in the paper [Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views](https://huggingface.co/papers/2510.18632). | |
| 3DThinker is a framework that enables Vision-Language Models (VLMs) to exploit geometric information within images for 3D spatial reasoning, simulating human-like spatial imagination without requiring explicit 3D prior inputs or labeled 3D training data. | |
| - **Paper:** [Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views](https://huggingface.co/papers/2510.18632) | |
| - **Code:** [GitHub - zhangquanchen/3DThinker](https://github.com/zhangquanchen/3DThinker) | |
| ## Introduction | |
| * The model was trained on **Mindcube_Train** and tested on **MindCube-Tiny**. | |
| * This model corresponds to **stage 1** training (supervised alignment of 3D latents) of Qwen2.5-3B-VL. | |
| * Note that Tab. 2 in the paper is trained on a different training data configuration. | |
| ## Bibtex | |
| If you find 3DThinker helpful for your work, please cite: | |
| ``` | |
| @article{chen2025think, | |
| title={Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views}, | |
| author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and Luo, Xufang and Sun, Mingze and Pan, Zihao and Feng, Yan and Pei, Peng and Cai, Xunliang and Huang, Ruqi}, | |
| journal={arXiv preprint arXiv:2510.18632}, | |
| year={2025} | |
| } | |
| ``` |