--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text --- # Reasoning in Space via Grounding in the World We present **Grounded-Spatial Reasoner (GS-Reasoner)**, the first 3D-LLM that bridges 3D visual grounding and spatial reasoning, as explored in the paper [Reasoning in Space via Grounding in the World](https://huggingface.co/papers/2510.13800). The goal of GS-Reasoner is to explore effective spatial representations that bridge the gap between 3D visual grounding and spatial reasoning. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information, leading to either poor grounding performance or excessive reliance on external modules. GS-Reasoner addresses this by proposing a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation. This enables GS-Reasoner to achieve autoregressive grounding entirely without external modules while delivering comparable performance to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. **Project Page**: [https://yiming-cc.github.io/gs-reasoner/](https://yiming-cc.github.io/gs-reasoner/) **Code**: [https://github.com/WU-CVGL/GS-Reasoner](https://github.com/WU-CVGL/GS-Reasoner)
## Model Weights We provide two pretrained model checkpoints: * **[GS-Reasoner](https://huggingface.co/ymccccc/GS-Reasoner)** – the main model used in our paper, producing more deterministic chain-of-thought reasoning. * **[GS-Reasoner-Diverse](https://huggingface.co/ymccccc/GS-Reasoner-Diverse)** – a variant that generates more diverse chain-of-thought outputs with only a minor performance drop (less than 1.0 on VSI-Bench). ## Sample Usage This section provides instructions on how to inference our pre-trained grounding models. The model can be loaded using classes from the `pae` library (which can be installed from the [GitHub repository](https://github.com/WU-CVGL/GS-Reasoner)). **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions. First, set up your environment as described in the [official GitHub repository's Setup section](https://github.com/WU-CVGL/GS-Reasoner#setup). This typically involves: ```bash conda create -n gs-reasoner python=3.11 -y conda activate gs-reasoner git clone git@github.com:WU-CVGL/GS-Reasoner.git cd GS-Reasoner pip install -e . # Install submodules dependencies as well, e.g., for Sonata and VSI-Bench Evaluation cd llava/submodules/sonata && pip install -r requirements.txt && cd ../../.. cd llava/submodules/lmms_eval && pip install -r requirements.txt && cd ../../.. ``` Inference code example: ```python import pae from pae.models import LlavaAgent, ClaudeAgent from accelerate import Accelerator import torch from tqdm import tqdm from types import SimpleNamespace from pae.environment.webgym import BatchedWebEnv import os from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM # ============= Instanstiate the agent ============= config_dict = {"use_lora": False, "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model "use_anyres": False, "temperature": 1.0, "max_new_tokens": 512, "train_vision": False, "num_beams": 1,} config = SimpleNamespace(**config_dict) accelerator = Accelerator() agent = LlavaAgent(policy_lm = "ymccccc/GS-Reasoner", # or "ymccccc/GS-Reasoner-Diverse" device = accelerator.device, accelerator = accelerator, config = config) # ============= Instanstiate the environment ============= test_tasks = [{"web_name": "Google Map", "id": "0", "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.", "web": "https://www.google.com/maps/"}] save_path = "xxx" # Placeholder, adapt for your needs test_env = BatchedWebEnv(tasks = test_tasks, do_eval = False, download_dir=os.path.join(save_path, 'test_driver', 'download'), output_dir=os.path.join(save_path, 'test_driver', 'output'), batch_size=1, max_iter=10,) # for you to check the images and actions image_histories = [] # stores the history of the paths of images action_histories = [] # stores the history of actions results = test_env.reset() image_histories.append(results[0][0]["image"]) observations = [r[0] for r in results] actions = agent.get_action(observations) action_histories.append(actions[0]) dones = None for _ in tqdm(range(3)): if dones is not None and all(dones): break results = test_env.step(actions) image_histories.append(results[0][0]["image"]) observations = [r[0] for r in results] actions = agent.get_action(observations) action_histories.append(actions[0]) dones = [r[2] for r in results] print("Done!") print("image_histories: ", image_histories) print("action_histories: ", action_histories) ``` ## Citation If you find our work helpful or inspiring, please feel free to cite it. ```bibtex @misc{chen2025gs-reasoner, title={Reasoning in Space via Grounding in the World}, author={Yiming Chen and Zekun Qi and Wenyao Zhang and Xin Jin and Li Zhang and Peidong Liu}, year={2025}, eprint={2510.13800}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13800}, } ```