File size: 6,137 Bytes
b405687
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
---

# Reasoning in Space via Grounding in the World

We present **Grounded-Spatial Reasoner (GS-Reasoner)**, the first 3D-LLM that bridges 3D visual grounding and spatial reasoning, as explored in the paper [Reasoning in Space via Grounding in the World](https://huggingface.co/papers/2510.13800).

The goal of GS-Reasoner is to explore effective spatial representations that bridge the gap between 3D visual grounding and spatial reasoning. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information, leading to either poor grounding performance or excessive reliance on external modules. GS-Reasoner addresses this by proposing a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation. This enables GS-Reasoner to achieve autoregressive grounding entirely without external modules while delivering comparable performance to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning.

**Project Page**: [https://yiming-cc.github.io/gs-reasoner/](https://yiming-cc.github.io/gs-reasoner/)
**Code**: [https://github.com/WU-CVGL/GS-Reasoner](https://github.com/WU-CVGL/GS-Reasoner)

<div style="text-align: center;">
    <img src="https://huggingface.co/spaces/ymccccc/GS-Reasoner/resolve/main/assets/teaser.png" width=100% >
</div>

## Model Weights
We provide two pretrained model checkpoints:

*   **[GS-Reasoner](https://huggingface.co/ymccccc/GS-Reasoner)** – the main model used in our paper, producing more deterministic chain-of-thought reasoning.
*   **[GS-Reasoner-Diverse](https://huggingface.co/ymccccc/GS-Reasoner-Diverse)** – a variant that generates more diverse chain-of-thought outputs with only a minor performance drop (less than 1.0 on VSI-Bench).

## Sample Usage

This section provides instructions on how to inference our pre-trained grounding models. The model can be loaded using classes from the `pae` library (which can be installed from the [GitHub repository](https://github.com/WU-CVGL/GS-Reasoner)).

**Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.

First, set up your environment as described in the [official GitHub repository's Setup section](https://github.com/WU-CVGL/GS-Reasoner#setup). This typically involves:
```bash
conda create -n gs-reasoner python=3.11 -y
conda activate gs-reasoner
git clone git@github.com:WU-CVGL/GS-Reasoner.git
cd GS-Reasoner
pip install -e .
# Install submodules dependencies as well, e.g., for Sonata and VSI-Bench Evaluation
cd llava/submodules/sonata && pip install -r requirements.txt && cd ../../..
cd llava/submodules/lmms_eval && pip install -r requirements.txt && cd ../../..
```

Inference code example:
```python
import pae
from pae.models import LlavaAgent, ClaudeAgent
from accelerate import Accelerator
import torch
from tqdm import tqdm
from types import SimpleNamespace
from pae.environment.webgym import BatchedWebEnv
import os
from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM

# ============= Instanstiate the agent =============
config_dict = {"use_lora": False,
               "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
               "use_anyres": False,
               "temperature": 1.0,
               "max_new_tokens": 512,
               "train_vision": False,
               "num_beams": 1,}
config = SimpleNamespace(**config_dict)

accelerator = Accelerator()
agent = LlavaAgent(policy_lm = "ymccccc/GS-Reasoner", # or "ymccccc/GS-Reasoner-Diverse"
                            device = accelerator.device,
                            accelerator = accelerator,
                            config = config)

# ============= Instanstiate the environment =============
test_tasks = [{"web_name": "Google Map",
               "id": "0",
          "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
          "web": "https://www.google.com/maps/"}]
save_path = "xxx" # Placeholder, adapt for your needs

test_env = BatchedWebEnv(tasks = test_tasks,
                        do_eval = False,
                        download_dir=os.path.join(save_path, 'test_driver', 'download'),
                        output_dir=os.path.join(save_path, 'test_driver', 'output'),
                        batch_size=1,
                        max_iter=10,)
# for you to check the images and actions
image_histories = [] # stores the history of the paths of images
action_histories = [] # stores the history of actions

results = test_env.reset()
image_histories.append(results[0][0]["image"])

observations = [r[0] for r in results]
actions = agent.get_action(observations)
action_histories.append(actions[0])
dones = None

for _ in tqdm(range(3)):
    if dones is not None and all(dones):
        break
    results = test_env.step(actions)
    image_histories.append(results[0][0]["image"])
    observations = [r[0] for r in results]
    actions = agent.get_action(observations)
    action_histories.append(actions[0])
    dones = [r[2] for r in results]

print("Done!")
print("image_histories: ", image_histories)
print("action_histories: ", action_histories)
```

## Citation
If you find our work helpful or inspiring, please feel free to cite it.

```bibtex
@misc{chen2025gs-reasoner,
    title={Reasoning in Space via Grounding in the World},
    author={Yiming Chen and Zekun Qi and Wenyao Zhang and Xin Jin and Li Zhang and Peidong Liu},
    year={2025},
    eprint={2510.13800},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2510.13800},
}
```