Add pipeline tag and paper link

#4
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +106 -48
README.md CHANGED
@@ -1,8 +1,9 @@
1
  ---
2
- license: cc-by-nc-4.0
3
- library_name: transformers
4
  base_model:
5
- - Qwen/Qwen2.5-0.5B-Instruct
 
 
 
6
  ---
7
 
8
  # SpatialLM-Qwen-0.5B
@@ -26,15 +27,17 @@ base_model:
26
  </div>
27
  <div align="center" style="line-height: 1;">
28
  <a href="https://huggingface.co/manycore-research/SpatialLM-Llama-1B" target="_blank" style="margin: 2px;"><img alt="Hugging Face"
29
- src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-SpatialLM%201B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
30
  <a href="https://huggingface.co/datasets/manycore-research/SpatialLM-Testset" target="_blank" style="margin: 2px;"><img alt="Dataset"
31
- src="https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-SpatialLM-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
32
  </div>
33
 
34
  ## Introduction
35
 
36
  SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.
37
 
 
 
38
  <div align="center">
39
  <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/3bz_jNRCLD2L9uj11HPnP.mp4" poster="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/euo94dNx28qBNe51_oiB1.png"></video>
40
  <p><i>SpatialLM reconstructs 3D layout from a monocular RGB video with MASt3R-SLAM. Results aligned to video with GT cameras for visualization.</i></p>
@@ -44,10 +47,12 @@ SpatialLM is a 3D large language model designed to process 3D point cloud data a
44
 
45
  <div align="center">
46
 
47
- | **Model** | **Download** |
48
- | :-----------------: | ------------------------------------------------------------------------------ |
49
- | SpatialLM-Llama-1B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM-Llama-1B) |
50
- | SpatialLM-Qwen-0.5B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM-Qwen-0.5B) |
 
 
51
 
52
  </div>
53
 
@@ -91,7 +96,23 @@ huggingface-cli download manycore-research/SpatialLM-Testset pcd/scene0000_00.pl
91
  Run inference:
92
 
93
  ```bash
94
- python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM-Qwen-0.5B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ```
96
 
97
  ### Visualization
@@ -119,13 +140,17 @@ huggingface-cli download manycore-research/SpatialLM-Testset --repo-type dataset
119
  Run evaluation:
120
 
121
  ```bash
122
- # Run inference on the PLY point clouds in folder SpatialLM-Testset/pcd with SpatialLM-Qwen-0.5B model
123
- python inference.py --point_cloud SpatialLM-Testset/pcd --output SpatialLM-Testset/pred --model_path manycore-research/SpatialLM-Qwen-0.5B
124
 
125
  # Evaluate the predicted layouts
126
  python eval.py --metadata SpatialLM-Testset/test.csv --gt_dir SpatialLM-Testset/layout --pred_dir SpatialLM-Testset/pred --label_mapping SpatialLM-Testset/benchmark_categories.tsv
127
  ```
128
 
 
 
 
 
129
  ## SpatialLM Testset
130
 
131
  We provide a test set of 107 preprocessed point clouds, reconstructed from RGB videos using [MASt3R-SLAM](https://github.com/rmurai0610/MASt3R-SLAM). SpatialLM-Testset is quite challenging compared to prior clean RGBD scans datasets due to the noises and occlusions in the point clouds reconstructed from monocular RGB videos.
@@ -140,38 +165,66 @@ We provide a test set of 107 preprocessed point clouds, reconstructed from RGB v
140
 
141
  ## Benchmark Results
142
 
143
- Benchmark results on the challenging SpatialLM-Testset are reported in the following table:
 
 
144
 
145
  <div align="center">
146
 
147
- | **Method** | **SpatialLM-Llama-1B** | **SpatialLM-Qwen-0.5B** |
148
- | ---------------- | ---------------------- | ----------------------- |
149
- | **Floorplan** | **mean IoU** | |
150
- | wall | 78.62 | 74.81 |
151
- | | | |
152
- | **Objects** | **F1 @.25 IoU (3D)** | |
153
- | curtain | 27.35 | 28.59 |
154
- | nightstand | 57.47 | 54.39 |
155
- | chandelier | 38.92 | 40.12 |
156
- | wardrobe | 23.33 | 30.60 |
157
- | bed | 95.24 | 93.75 |
158
- | sofa | 65.50 | 66.15 |
159
- | chair | 21.26 | 14.94 |
160
- | cabinet | 8.47 | 8.44 |
161
- | dining table | 54.26 | 56.10 |
162
- | plants | 20.68 | 26.46 |
163
- | tv cabinet | 33.33 | 10.26 |
164
- | coffee table | 50.00 | 55.56 |
165
- | side table | 7.60 | 2.17 |
166
- | air conditioner | 20.00 | 13.04 |
167
- | dresser | 46.67 | 23.53 |
168
- | | | |
169
- | **Thin Objects** | **F1 @.25 IoU (2D)** | |
170
- | painting | 50.04 | 53.81 |
171
- | carpet | 31.76 | 45.31 |
172
- | tv | 67.31 | 52.29 |
173
- | door | 50.35 | 42.15 |
174
- | window | 45.4 | 45.9 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
  </div>
177
 
@@ -180,18 +233,23 @@ Benchmark results on the challenging SpatialLM-Testset are reported in the follo
180
  SpatialLM-Llama-1B is derived from Llama3.2-1B-Instruct, which is licensed under the Llama3.2 license.
181
  SpatialLM-Qwen-0.5B is derived from the Qwen-2.5 series, originally licensed under the Apache 2.0 License.
182
 
183
- All models are built upon the SceneScript point cloud encoder, licensed under the CC-BY-NC-4.0 License. TorchSparse, utilized in this project, is licensed under the MIT License.
 
 
184
 
185
  ## Citation
186
 
187
  If you find this work useful, please consider citing:
188
 
189
  ```bibtex
190
- @misc{spatiallm,
191
- title = {SpatialLM: Large Language Model for Spatial Understanding},
192
- author = {ManyCore Research Team},
193
- howpublished = {\url{https://github.com/manycore-research/SpatialLM}},
194
- year = {2025}
 
 
 
195
  }
196
  ```
197
 
@@ -199,4 +257,4 @@ If you find this work useful, please consider citing:
199
 
200
  We would like to thank the following projects that made this work possible:
201
 
202
- [Llama3.2](https://github.com/meta-llama) | [Qwen2.5](https://github.com/QwenLM/Qwen2.5) | [Transformers](https://github.com/huggingface/transformers) | [SceneScript](https://github.com/facebookresearch/scenescript) | [TorchSparse](https://github.com/mit-han-lab/torchsparse)
 
1
  ---
 
 
2
  base_model:
3
+ - Qwen/Qwen2.5-0.5B-Instruct
4
+ library_name: transformers
5
+ license: cc-by-nc-4.0
6
+ pipeline_tag: image-to-3d
7
  ---
8
 
9
  # SpatialLM-Qwen-0.5B
 
27
  </div>
28
  <div align="center" style="line-height: 1;">
29
  <a href="https://huggingface.co/manycore-research/SpatialLM-Llama-1B" target="_blank" style="margin: 2px;"><img alt="Hugging Face"
30
+ src="https://img.shields.io/badge/🤗%20Hugging%20Face-SpatialLM%201B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
31
  <a href="https://huggingface.co/datasets/manycore-research/SpatialLM-Testset" target="_blank" style="margin: 2px;"><img alt="Dataset"
32
+ src="https://img.shields.io/badge/🤗%20Dataset-SpatialLM-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
33
  </div>
34
 
35
  ## Introduction
36
 
37
  SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.
38
 
39
+ This repository contains the model described in [SpatialLM: Training Large Language Models for Structured Indoor Modeling](https://huggingface.co/papers/2506.07491).
40
+
41
  <div align="center">
42
  <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/3bz_jNRCLD2L9uj11HPnP.mp4" poster="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/euo94dNx28qBNe51_oiB1.png"></video>
43
  <p><i>SpatialLM reconstructs 3D layout from a monocular RGB video with MASt3R-SLAM. Results aligned to video with GT cameras for visualization.</i></p>
 
47
 
48
  <div align="center">
49
 
50
+ | **Model** | **Download** |
51
+ | :--------------------: | --------------------------------------------------------------------------------- |
52
+ | SpatialLM1.1-Llama-1B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM1.1-Llama-1B) |
53
+ | SpatialLM1.1-Qwen-0.5B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM1.1-Qwen-0.5B) |
54
+ | SpatialLM1.0-Llama-1B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM-Llama-1B) |
55
+ | SpatialLM1.0-Qwen-0.5B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM-Qwen-0.5B) |
56
 
57
  </div>
58
 
 
96
  Run inference:
97
 
98
  ```bash
99
+ python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM1.1-Qwen-0.5B
100
+ ```
101
+
102
+ ### Detection with user-specified categories
103
+
104
+ SpatialLM1.1 supports object detection conditioned on user-specified categories by leveraging the flexibility of LLMs.
105
+
106
+ SpatialLM1.1 offers three variants of structured indoor modeling tasks:
107
+
108
+ - **Structured Reconstruction**: Detect walls, doors, windows, boxes.
109
+ - **Layout Estimation**: Detect walls, doors, windows.
110
+ - **3D Object Detection**: Detect boxes.
111
+
112
+ For tasks that include object box estimation, you can specify a subset of the 59 furniture categories, and the model will only predict objects within those specified categories. For example:
113
+
114
+ ```bash
115
+ python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM1.1-Qwen-0.5B --detect_type object --category bed nightstand
116
  ```
117
 
118
  ### Visualization
 
140
  Run evaluation:
141
 
142
  ```bash
143
+ # Run inference on the PLY point clouds in folder SpatialLM-Testset/pcd with SpatialLM1.1-Qwen-0.5B model
144
+ python inference.py --point_cloud SpatialLM-Testset/pcd --output SpatialLM-Testset/pred --model_path manycore-research/SpatialLM1.1-Qwen-0.5B
145
 
146
  # Evaluate the predicted layouts
147
  python eval.py --metadata SpatialLM-Testset/test.csv --gt_dir SpatialLM-Testset/layout --pred_dir SpatialLM-Testset/pred --label_mapping SpatialLM-Testset/benchmark_categories.tsv
148
  ```
149
 
150
+ ### Example using a custom video
151
+
152
+ We provide an example of how to use our model to estimate scene layout starting from a RGB video with the newly released [SLAM3R](https://github.com/PKU-VCL-3DV/SLAM3R) in [EXAMPLE.md](EXAMPLE.md). These steps work for MASt3R-SLAM, and other reconstruction methods as well.
153
+
154
  ## SpatialLM Testset
155
 
156
  We provide a test set of 107 preprocessed point clouds, reconstructed from RGB videos using [MASt3R-SLAM](https://github.com/rmurai0610/MASt3R-SLAM). SpatialLM-Testset is quite challenging compared to prior clean RGBD scans datasets due to the noises and occlusions in the point clouds reconstructed from monocular RGB videos.
 
165
 
166
  ## Benchmark Results
167
 
168
+ ### Layout Estimation
169
+
170
+ Layout estimation focuses on predicting architectural elements, i.e., walls, doors, and windows, within an indoor scene. We evaluated this task on the [Structured3D](https://structured3d-dataset.org) dataset. For [RoomFormer](https://github.com/ywyue/RoomFormer), we directly downloaded the model checkpoint. SceneScript and SpatialLM were first trained on our dataset, and further fine-tuned on Structured3D.
171
 
172
  <div align="center">
173
 
174
+ | **Method** | **RoomFormer** | **SceneScript (finetuned)** | **SpatialLM1.1-Qwen-0.5B (finetuned)** |
175
+ | :-------------: | :------------: | :-------------------------: | :------------------------------------: |
176
+ | **F1 @.25 IoU** | 70.4 | 83.1 | 86.5 |
177
+ | **F1 @.5 IoU** | 67.2 | 80.8 | 84.6 |
178
+
179
+ </div>
180
+
181
+ ### 3D Object Detection
182
+
183
+ We evaluate 3D object detection on [ScanNet](http://www.scan-net.org) with annotations of 18 object categories. For [V-DETR](https://github.com/V-DETR/V-DETR), we directly download the model checkpoint. SceneScript and SpatialLM were first trained on our dataset, and further fine-tuned on ScanNet.
184
+
185
+ <div align="center">
186
+
187
+ | **Method** | **V-DETR** | **SceneScript (finetuned)** | **SpatialLM1.1-Qwen-0.5B (finetuned)** |
188
+ | :-------------: | :--------: | :-------------------------: | :------------------------------------: |
189
+ | **F1 @.25 IoU** | 65.1 | 49.1 | 65.6 |
190
+ | **F1 @.5 IoU** | 56.8 | 36.8 | 52.6 |
191
+
192
+ </div>
193
+
194
+ ### Zero-shot Detection on Videos
195
+
196
+ Zero-shot detection results on the challenging SpatialLM-Testset are reported in the following table:
197
+
198
+ <div align="center">
199
+
200
+ | **Method** | **SpatialLM1.1-Llama-1B** | **SpatialLM1.1-Qwen-0.5B** |
201
+ | :-------------: | :-----------------------: | :------------------------: |
202
+ | **Layout** | **F1 @.25 IoU (2D)** | **F1 @.25 IoU (2D)** |
203
+ | wall | 68.9 | 68.2 |
204
+ | door | 46.3 | 43.1 |
205
+ | window | 43.8 | 47.4 |
206
+ | | | |
207
+ | **Objects** | **F1 @.25 IoU (3D)** | **F1 @.25 IoU (2D)** |
208
+ | curtain | 34.9 | 37.0 |
209
+ | nightstand | 62.8 | 67.0 |
210
+ | chandelier | 53.5 | 36.8 |
211
+ | wardrobe | 29.4 | 39.6 |
212
+ | bed | 96.8 | 95.2 |
213
+ | sofa | 66.9 | 69.1 |
214
+ | chair | 20.8 | 32.3 |
215
+ | cabinet | 15.2 | 11.2 |
216
+ | dining table | 40.7 | 24.2 |
217
+ | plants | 29.5 | 26.3 |
218
+ | tv cabinet | 34.4 | 27.3 |
219
+ | coffee table | 56.4 | 64.9 |
220
+ | side table | 14.6 | 9.7 |
221
+ | air conditioner | 16.7 | 24.0 |
222
+ | dresser | 46.7 | 46.7 |
223
+ | stool | 17.6 | 30.8 |
224
+ | refrigerator | 0.0 | 16.7 |
225
+ | painting | 34.9 | 38.2 |
226
+ | carpet | 40.3 | 24.1 |
227
+ | tv | 16.0 | 18.0 |
228
 
229
  </div>
230
 
 
233
  SpatialLM-Llama-1B is derived from Llama3.2-1B-Instruct, which is licensed under the Llama3.2 license.
234
  SpatialLM-Qwen-0.5B is derived from the Qwen-2.5 series, originally licensed under the Apache 2.0 License.
235
 
236
+ SpatialLM1.0 are built upon the SceneScript point cloud encoder, licensed under the CC-BY-NC-4.0 License. TorchSparse, utilized in this project, is licensed under the MIT License.
237
+
238
+ SpatialLM1.1 are built upon Sonata point cloud encoder, model weight is licensed under the CC-BY-NC-4.0 License. Code built on Pointcept is licensed under the Apache 2.0 License.
239
 
240
  ## Citation
241
 
242
  If you find this work useful, please consider citing:
243
 
244
  ```bibtex
245
+ @article{SpatialLM,
246
+ title = {SpatialLM: Training Large Language Models for Structured Indoor Modeling},
247
+ author = {Mao, Yongsen and Zhong, Junhao and Fang, Chuan and Zheng, Jia and Tang, Rui and Zhu, Hao and Tan, Ping and Zhou, Zihan},
248
+ journal = {arXiv preprint},
249
+ year = {2025},
250
+ eprint = {2506.07491},
251
+ archivePrefix = {arXiv},
252
+ primaryClass = {cs.CV}
253
  }
254
  ```
255
 
 
257
 
258
  We would like to thank the following projects that made this work possible:
259
 
260
+ [Llama3.2](https://github.com/meta-llama) | [Qwen2.5](https://github.com/QwenLM/Qwen2.5) | [Transformers](https://github.com/huggingface/transformers) | [SceneScript](https://github.com/facebookresearch/scenescript) | [TorchSparse](https://github.com/mit-han-lab/torchsparse) | [Sonata](https://xywu.me/sonata/) | [Pointcept](https://github.com/Pointcept/Pointcept)