Improve model card: Add metadata, links, and usage for GS-Reasoner

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +125 -3
README.md CHANGED
@@ -1,3 +1,125 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # Reasoning in Space via Grounding in the World
8
+
9
+ We present **Grounded-Spatial Reasoner (GS-Reasoner)**, the first 3D-LLM that bridges 3D visual grounding and spatial reasoning, as explored in the paper [Reasoning in Space via Grounding in the World](https://huggingface.co/papers/2510.13800).
10
+
11
+ The goal of GS-Reasoner is to explore effective spatial representations that bridge the gap between 3D visual grounding and spatial reasoning. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information, leading to either poor grounding performance or excessive reliance on external modules. GS-Reasoner addresses this by proposing a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation. This enables GS-Reasoner to achieve autoregressive grounding entirely without external modules while delivering comparable performance to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning.
12
+
13
+ **Project Page**: [https://yiming-cc.github.io/gs-reasoner/](https://yiming-cc.github.io/gs-reasoner/)
14
+ **Code**: [https://github.com/WU-CVGL/GS-Reasoner](https://github.com/WU-CVGL/GS-Reasoner)
15
+
16
+ <div style="text-align: center;">
17
+ <img src="https://huggingface.co/spaces/ymccccc/GS-Reasoner/resolve/main/assets/teaser.png" width=100% >
18
+ </div>
19
+
20
+ ## Model Weights
21
+ We provide two pretrained model checkpoints:
22
+
23
+ * **[GS-Reasoner](https://huggingface.co/ymccccc/GS-Reasoner)** – the main model used in our paper, producing more deterministic chain-of-thought reasoning.
24
+ * **[GS-Reasoner-Diverse](https://huggingface.co/ymccccc/GS-Reasoner-Diverse)** – a variant that generates more diverse chain-of-thought outputs with only a minor performance drop (less than 1.0 on VSI-Bench).
25
+
26
+ ## Sample Usage
27
+
28
+ This section provides instructions on how to inference our pre-trained grounding models. The model can be loaded using classes from the `pae` library (which can be installed from the [GitHub repository](https://github.com/WU-CVGL/GS-Reasoner)).
29
+
30
+ **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
31
+
32
+ First, set up your environment as described in the [official GitHub repository's Setup section](https://github.com/WU-CVGL/GS-Reasoner#setup). This typically involves:
33
+ ```bash
34
+ conda create -n gs-reasoner python=3.11 -y
35
+ conda activate gs-reasoner
36
+ git clone git@github.com:WU-CVGL/GS-Reasoner.git
37
+ cd GS-Reasoner
38
+ pip install -e .
39
+ # Install submodules dependencies as well, e.g., for Sonata and VSI-Bench Evaluation
40
+ cd llava/submodules/sonata && pip install -r requirements.txt && cd ../../..
41
+ cd llava/submodules/lmms_eval && pip install -r requirements.txt && cd ../../..
42
+ ```
43
+
44
+ Inference code example:
45
+ ```python
46
+ import pae
47
+ from pae.models import LlavaAgent, ClaudeAgent
48
+ from accelerate import Accelerator
49
+ import torch
50
+ from tqdm import tqdm
51
+ from types import SimpleNamespace
52
+ from pae.environment.webgym import BatchedWebEnv
53
+ import os
54
+ from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
55
+
56
+ # ============= Instanstiate the agent =============
57
+ config_dict = {"use_lora": False,
58
+ "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
59
+ "use_anyres": False,
60
+ "temperature": 1.0,
61
+ "max_new_tokens": 512,
62
+ "train_vision": False,
63
+ "num_beams": 1,}
64
+ config = SimpleNamespace(**config_dict)
65
+
66
+ accelerator = Accelerator()
67
+ agent = LlavaAgent(policy_lm = "ymccccc/GS-Reasoner", # or "ymccccc/GS-Reasoner-Diverse"
68
+ device = accelerator.device,
69
+ accelerator = accelerator,
70
+ config = config)
71
+
72
+ # ============= Instanstiate the environment =============
73
+ test_tasks = [{"web_name": "Google Map",
74
+ "id": "0",
75
+ "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
76
+ "web": "https://www.google.com/maps/"}]
77
+ save_path = "xxx" # Placeholder, adapt for your needs
78
+
79
+ test_env = BatchedWebEnv(tasks = test_tasks,
80
+ do_eval = False,
81
+ download_dir=os.path.join(save_path, 'test_driver', 'download'),
82
+ output_dir=os.path.join(save_path, 'test_driver', 'output'),
83
+ batch_size=1,
84
+ max_iter=10,)
85
+ # for you to check the images and actions
86
+ image_histories = [] # stores the history of the paths of images
87
+ action_histories = [] # stores the history of actions
88
+
89
+ results = test_env.reset()
90
+ image_histories.append(results[0][0]["image"])
91
+
92
+ observations = [r[0] for r in results]
93
+ actions = agent.get_action(observations)
94
+ action_histories.append(actions[0])
95
+ dones = None
96
+
97
+ for _ in tqdm(range(3)):
98
+ if dones is not None and all(dones):
99
+ break
100
+ results = test_env.step(actions)
101
+ image_histories.append(results[0][0]["image"])
102
+ observations = [r[0] for r in results]
103
+ actions = agent.get_action(observations)
104
+ action_histories.append(actions[0])
105
+ dones = [r[2] for r in results]
106
+
107
+ print("Done!")
108
+ print("image_histories: ", image_histories)
109
+ print("action_histories: ", action_histories)
110
+ ```
111
+
112
+ ## Citation
113
+ If you find our work helpful or inspiring, please feel free to cite it.
114
+
115
+ ```bibtex
116
+ @misc{chen2025gs-reasoner,
117
+ title={Reasoning in Space via Grounding in the World},
118
+ author={Yiming Chen and Zekun Qi and Wenyao Zhang and Xin Jin and Li Zhang and Peidong Liu},
119
+ year={2025},
120
+ eprint={2510.13800},
121
+ archivePrefix={arXiv},
122
+ primaryClass={cs.CV},
123
+ url={https://arxiv.org/abs/2510.13800},
124
+ }
125
+ ```