Improve model card: Add metadata, links, and usage for GS-Reasoner
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,125 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Reasoning in Space via Grounding in the World
|
| 8 |
+
|
| 9 |
+
We present **Grounded-Spatial Reasoner (GS-Reasoner)**, the first 3D-LLM that bridges 3D visual grounding and spatial reasoning, as explored in the paper [Reasoning in Space via Grounding in the World](https://huggingface.co/papers/2510.13800).
|
| 10 |
+
|
| 11 |
+
The goal of GS-Reasoner is to explore effective spatial representations that bridge the gap between 3D visual grounding and spatial reasoning. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information, leading to either poor grounding performance or excessive reliance on external modules. GS-Reasoner addresses this by proposing a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation. This enables GS-Reasoner to achieve autoregressive grounding entirely without external modules while delivering comparable performance to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning.
|
| 12 |
+
|
| 13 |
+
**Project Page**: [https://yiming-cc.github.io/gs-reasoner/](https://yiming-cc.github.io/gs-reasoner/)
|
| 14 |
+
**Code**: [https://github.com/WU-CVGL/GS-Reasoner](https://github.com/WU-CVGL/GS-Reasoner)
|
| 15 |
+
|
| 16 |
+
<div style="text-align: center;">
|
| 17 |
+
<img src="https://huggingface.co/spaces/ymccccc/GS-Reasoner/resolve/main/assets/teaser.png" width=100% >
|
| 18 |
+
</div>
|
| 19 |
+
|
| 20 |
+
## Model Weights
|
| 21 |
+
We provide two pretrained model checkpoints:
|
| 22 |
+
|
| 23 |
+
* **[GS-Reasoner](https://huggingface.co/ymccccc/GS-Reasoner)** – the main model used in our paper, producing more deterministic chain-of-thought reasoning.
|
| 24 |
+
* **[GS-Reasoner-Diverse](https://huggingface.co/ymccccc/GS-Reasoner-Diverse)** – a variant that generates more diverse chain-of-thought outputs with only a minor performance drop (less than 1.0 on VSI-Bench).
|
| 25 |
+
|
| 26 |
+
## Sample Usage
|
| 27 |
+
|
| 28 |
+
This section provides instructions on how to inference our pre-trained grounding models. The model can be loaded using classes from the `pae` library (which can be installed from the [GitHub repository](https://github.com/WU-CVGL/GS-Reasoner)).
|
| 29 |
+
|
| 30 |
+
**Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
|
| 31 |
+
|
| 32 |
+
First, set up your environment as described in the [official GitHub repository's Setup section](https://github.com/WU-CVGL/GS-Reasoner#setup). This typically involves:
|
| 33 |
+
```bash
|
| 34 |
+
conda create -n gs-reasoner python=3.11 -y
|
| 35 |
+
conda activate gs-reasoner
|
| 36 |
+
git clone git@github.com:WU-CVGL/GS-Reasoner.git
|
| 37 |
+
cd GS-Reasoner
|
| 38 |
+
pip install -e .
|
| 39 |
+
# Install submodules dependencies as well, e.g., for Sonata and VSI-Bench Evaluation
|
| 40 |
+
cd llava/submodules/sonata && pip install -r requirements.txt && cd ../../..
|
| 41 |
+
cd llava/submodules/lmms_eval && pip install -r requirements.txt && cd ../../..
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
Inference code example:
|
| 45 |
+
```python
|
| 46 |
+
import pae
|
| 47 |
+
from pae.models import LlavaAgent, ClaudeAgent
|
| 48 |
+
from accelerate import Accelerator
|
| 49 |
+
import torch
|
| 50 |
+
from tqdm import tqdm
|
| 51 |
+
from types import SimpleNamespace
|
| 52 |
+
from pae.environment.webgym import BatchedWebEnv
|
| 53 |
+
import os
|
| 54 |
+
from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
|
| 55 |
+
|
| 56 |
+
# ============= Instanstiate the agent =============
|
| 57 |
+
config_dict = {"use_lora": False,
|
| 58 |
+
"use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
|
| 59 |
+
"use_anyres": False,
|
| 60 |
+
"temperature": 1.0,
|
| 61 |
+
"max_new_tokens": 512,
|
| 62 |
+
"train_vision": False,
|
| 63 |
+
"num_beams": 1,}
|
| 64 |
+
config = SimpleNamespace(**config_dict)
|
| 65 |
+
|
| 66 |
+
accelerator = Accelerator()
|
| 67 |
+
agent = LlavaAgent(policy_lm = "ymccccc/GS-Reasoner", # or "ymccccc/GS-Reasoner-Diverse"
|
| 68 |
+
device = accelerator.device,
|
| 69 |
+
accelerator = accelerator,
|
| 70 |
+
config = config)
|
| 71 |
+
|
| 72 |
+
# ============= Instanstiate the environment =============
|
| 73 |
+
test_tasks = [{"web_name": "Google Map",
|
| 74 |
+
"id": "0",
|
| 75 |
+
"ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
|
| 76 |
+
"web": "https://www.google.com/maps/"}]
|
| 77 |
+
save_path = "xxx" # Placeholder, adapt for your needs
|
| 78 |
+
|
| 79 |
+
test_env = BatchedWebEnv(tasks = test_tasks,
|
| 80 |
+
do_eval = False,
|
| 81 |
+
download_dir=os.path.join(save_path, 'test_driver', 'download'),
|
| 82 |
+
output_dir=os.path.join(save_path, 'test_driver', 'output'),
|
| 83 |
+
batch_size=1,
|
| 84 |
+
max_iter=10,)
|
| 85 |
+
# for you to check the images and actions
|
| 86 |
+
image_histories = [] # stores the history of the paths of images
|
| 87 |
+
action_histories = [] # stores the history of actions
|
| 88 |
+
|
| 89 |
+
results = test_env.reset()
|
| 90 |
+
image_histories.append(results[0][0]["image"])
|
| 91 |
+
|
| 92 |
+
observations = [r[0] for r in results]
|
| 93 |
+
actions = agent.get_action(observations)
|
| 94 |
+
action_histories.append(actions[0])
|
| 95 |
+
dones = None
|
| 96 |
+
|
| 97 |
+
for _ in tqdm(range(3)):
|
| 98 |
+
if dones is not None and all(dones):
|
| 99 |
+
break
|
| 100 |
+
results = test_env.step(actions)
|
| 101 |
+
image_histories.append(results[0][0]["image"])
|
| 102 |
+
observations = [r[0] for r in results]
|
| 103 |
+
actions = agent.get_action(observations)
|
| 104 |
+
action_histories.append(actions[0])
|
| 105 |
+
dones = [r[2] for r in results]
|
| 106 |
+
|
| 107 |
+
print("Done!")
|
| 108 |
+
print("image_histories: ", image_histories)
|
| 109 |
+
print("action_histories: ", action_histories)
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
## Citation
|
| 113 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 114 |
+
|
| 115 |
+
```bibtex
|
| 116 |
+
@misc{chen2025gs-reasoner,
|
| 117 |
+
title={Reasoning in Space via Grounding in the World},
|
| 118 |
+
author={Yiming Chen and Zekun Qi and Wenyao Zhang and Xin Jin and Li Zhang and Peidong Liu},
|
| 119 |
+
year={2025},
|
| 120 |
+
eprint={2510.13800},
|
| 121 |
+
archivePrefix={arXiv},
|
| 122 |
+
primaryClass={cs.CV},
|
| 123 |
+
url={https://arxiv.org/abs/2510.13800},
|
| 124 |
+
}
|
| 125 |
+
```
|