license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
Reasoning in Space via Grounding in the World
We present Grounded-Spatial Reasoner (GS-Reasoner), the first 3D-LLM that bridges 3D visual grounding and spatial reasoning, as explored in the paper Reasoning in Space via Grounding in the World.
The goal of GS-Reasoner is to explore effective spatial representations that bridge the gap between 3D visual grounding and spatial reasoning. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information, leading to either poor grounding performance or excessive reliance on external modules. GS-Reasoner addresses this by proposing a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation. This enables GS-Reasoner to achieve autoregressive grounding entirely without external modules while delivering comparable performance to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning.
Project Page: https://yiming-cc.github.io/gs-reasoner/ Code: https://github.com/WU-CVGL/GS-Reasoner
Model Weights
We provide two pretrained model checkpoints:
- GS-Reasoner – the main model used in our paper, producing more deterministic chain-of-thought reasoning.
- GS-Reasoner-Diverse – a variant that generates more diverse chain-of-thought outputs with only a minor performance drop (less than 1.0 on VSI-Bench).
Sample Usage
This section provides instructions on how to inference our pre-trained grounding models. The model can be loaded using classes from the pae library (which can be installed from the GitHub repository).
Notes: Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
First, set up your environment as described in the official GitHub repository's Setup section. This typically involves:
conda create -n gs-reasoner python=3.11 -y
conda activate gs-reasoner
git clone git@github.com:WU-CVGL/GS-Reasoner.git
cd GS-Reasoner
pip install -e .
# Install submodules dependencies as well, e.g., for Sonata and VSI-Bench Evaluation
cd llava/submodules/sonata && pip install -r requirements.txt && cd ../../..
cd llava/submodules/lmms_eval && pip install -r requirements.txt && cd ../../..
Inference code example:
import pae
from pae.models import LlavaAgent, ClaudeAgent
from accelerate import Accelerator
import torch
from tqdm import tqdm
from types import SimpleNamespace
from pae.environment.webgym import BatchedWebEnv
import os
from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
# ============= Instanstiate the agent =============
config_dict = {"use_lora": False,
"use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
"use_anyres": False,
"temperature": 1.0,
"max_new_tokens": 512,
"train_vision": False,
"num_beams": 1,}
config = SimpleNamespace(**config_dict)
accelerator = Accelerator()
agent = LlavaAgent(policy_lm = "ymccccc/GS-Reasoner", # or "ymccccc/GS-Reasoner-Diverse"
device = accelerator.device,
accelerator = accelerator,
config = config)
# ============= Instanstiate the environment =============
test_tasks = [{"web_name": "Google Map",
"id": "0",
"ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
"web": "https://www.google.com/maps/"}]
save_path = "xxx" # Placeholder, adapt for your needs
test_env = BatchedWebEnv(tasks = test_tasks,
do_eval = False,
download_dir=os.path.join(save_path, 'test_driver', 'download'),
output_dir=os.path.join(save_path, 'test_driver', 'output'),
batch_size=1,
max_iter=10,)
# for you to check the images and actions
image_histories = [] # stores the history of the paths of images
action_histories = [] # stores the history of actions
results = test_env.reset()
image_histories.append(results[0][0]["image"])
observations = [r[0] for r in results]
actions = agent.get_action(observations)
action_histories.append(actions[0])
dones = None
for _ in tqdm(range(3)):
if dones is not None and all(dones):
break
results = test_env.step(actions)
image_histories.append(results[0][0]["image"])
observations = [r[0] for r in results]
actions = agent.get_action(observations)
action_histories.append(actions[0])
dones = [r[2] for r in results]
print("Done!")
print("image_histories: ", image_histories)
print("action_histories: ", action_histories)
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@misc{chen2025gs-reasoner,
title={Reasoning in Space via Grounding in the World},
author={Yiming Chen and Zekun Qi and Wenyao Zhang and Xin Jin and Li Zhang and Peidong Liu},
year={2025},
eprint={2510.13800},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13800},
}