Add pipeline tag and paper link
#4
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
-
library_name: transformers
|
| 4 |
base_model:
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
# SpatialLM-Qwen-0.5B
|
|
@@ -26,15 +27,17 @@ base_model:
|
|
| 26 |
</div>
|
| 27 |
<div align="center" style="line-height: 1;">
|
| 28 |
<a href="https://huggingface.co/manycore-research/SpatialLM-Llama-1B" target="_blank" style="margin: 2px;"><img alt="Hugging Face"
|
| 29 |
-
src="https://img.shields.io/badge
|
| 30 |
<a href="https://huggingface.co/datasets/manycore-research/SpatialLM-Testset" target="_blank" style="margin: 2px;"><img alt="Dataset"
|
| 31 |
-
src="https://img.shields.io/badge
|
| 32 |
</div>
|
| 33 |
|
| 34 |
## Introduction
|
| 35 |
|
| 36 |
SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.
|
| 37 |
|
|
|
|
|
|
|
| 38 |
<div align="center">
|
| 39 |
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/3bz_jNRCLD2L9uj11HPnP.mp4" poster="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/euo94dNx28qBNe51_oiB1.png"></video>
|
| 40 |
<p><i>SpatialLM reconstructs 3D layout from a monocular RGB video with MASt3R-SLAM. Results aligned to video with GT cameras for visualization.</i></p>
|
|
@@ -44,10 +47,12 @@ SpatialLM is a 3D large language model designed to process 3D point cloud data a
|
|
| 44 |
|
| 45 |
<div align="center">
|
| 46 |
|
| 47 |
-
|
|
| 48 |
-
|
|
| 49 |
-
|
|
| 50 |
-
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
</div>
|
| 53 |
|
|
@@ -91,7 +96,23 @@ huggingface-cli download manycore-research/SpatialLM-Testset pcd/scene0000_00.pl
|
|
| 91 |
Run inference:
|
| 92 |
|
| 93 |
```bash
|
| 94 |
-
python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
```
|
| 96 |
|
| 97 |
### Visualization
|
|
@@ -119,13 +140,17 @@ huggingface-cli download manycore-research/SpatialLM-Testset --repo-type dataset
|
|
| 119 |
Run evaluation:
|
| 120 |
|
| 121 |
```bash
|
| 122 |
-
# Run inference on the PLY point clouds in folder SpatialLM-Testset/pcd with
|
| 123 |
-
python inference.py --point_cloud SpatialLM-Testset/pcd --output SpatialLM-Testset/pred --model_path manycore-research/
|
| 124 |
|
| 125 |
# Evaluate the predicted layouts
|
| 126 |
python eval.py --metadata SpatialLM-Testset/test.csv --gt_dir SpatialLM-Testset/layout --pred_dir SpatialLM-Testset/pred --label_mapping SpatialLM-Testset/benchmark_categories.tsv
|
| 127 |
```
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
## SpatialLM Testset
|
| 130 |
|
| 131 |
We provide a test set of 107 preprocessed point clouds, reconstructed from RGB videos using [MASt3R-SLAM](https://github.com/rmurai0610/MASt3R-SLAM). SpatialLM-Testset is quite challenging compared to prior clean RGBD scans datasets due to the noises and occlusions in the point clouds reconstructed from monocular RGB videos.
|
|
@@ -140,38 +165,66 @@ We provide a test set of 107 preprocessed point clouds, reconstructed from RGB v
|
|
| 140 |
|
| 141 |
## Benchmark Results
|
| 142 |
|
| 143 |
-
|
|
|
|
|
|
|
| 144 |
|
| 145 |
<div align="center">
|
| 146 |
|
| 147 |
-
|
|
| 148 |
-
|
|
| 149 |
-
| **
|
| 150 |
-
|
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
|
| 161 |
-
|
|
| 162 |
-
|
|
| 163 |
-
|
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
</div>
|
| 177 |
|
|
@@ -180,18 +233,23 @@ Benchmark results on the challenging SpatialLM-Testset are reported in the follo
|
|
| 180 |
SpatialLM-Llama-1B is derived from Llama3.2-1B-Instruct, which is licensed under the Llama3.2 license.
|
| 181 |
SpatialLM-Qwen-0.5B is derived from the Qwen-2.5 series, originally licensed under the Apache 2.0 License.
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
| 184 |
|
| 185 |
## Citation
|
| 186 |
|
| 187 |
If you find this work useful, please consider citing:
|
| 188 |
|
| 189 |
```bibtex
|
| 190 |
-
@
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
|
|
|
|
|
|
|
|
|
| 195 |
}
|
| 196 |
```
|
| 197 |
|
|
@@ -199,4 +257,4 @@ If you find this work useful, please consider citing:
|
|
| 199 |
|
| 200 |
We would like to thank the following projects that made this work possible:
|
| 201 |
|
| 202 |
-
[Llama3.2](https://github.com/meta-llama) | [Qwen2.5](https://github.com/QwenLM/Qwen2.5) | [Transformers](https://github.com/huggingface/transformers) | [SceneScript](https://github.com/facebookresearch/scenescript) | [TorchSparse](https://github.com/mit-han-lab/torchsparse)
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-0.5B-Instruct
|
| 4 |
+
library_name: transformers
|
| 5 |
+
license: cc-by-nc-4.0
|
| 6 |
+
pipeline_tag: image-to-3d
|
| 7 |
---
|
| 8 |
|
| 9 |
# SpatialLM-Qwen-0.5B
|
|
|
|
| 27 |
</div>
|
| 28 |
<div align="center" style="line-height: 1;">
|
| 29 |
<a href="https://huggingface.co/manycore-research/SpatialLM-Llama-1B" target="_blank" style="margin: 2px;"><img alt="Hugging Face"
|
| 30 |
+
src="https://img.shields.io/badge/🤗%20Hugging%20Face-SpatialLM%201B-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
|
| 31 |
<a href="https://huggingface.co/datasets/manycore-research/SpatialLM-Testset" target="_blank" style="margin: 2px;"><img alt="Dataset"
|
| 32 |
+
src="https://img.shields.io/badge/🤗%20Dataset-SpatialLM-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
|
| 33 |
</div>
|
| 34 |
|
| 35 |
## Introduction
|
| 36 |
|
| 37 |
SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.
|
| 38 |
|
| 39 |
+
This repository contains the model described in [SpatialLM: Training Large Language Models for Structured Indoor Modeling](https://huggingface.co/papers/2506.07491).
|
| 40 |
+
|
| 41 |
<div align="center">
|
| 42 |
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/3bz_jNRCLD2L9uj11HPnP.mp4" poster="https://cdn-uploads.huggingface.co/production/uploads/63efbb1efc92a63ac81126d0/euo94dNx28qBNe51_oiB1.png"></video>
|
| 43 |
<p><i>SpatialLM reconstructs 3D layout from a monocular RGB video with MASt3R-SLAM. Results aligned to video with GT cameras for visualization.</i></p>
|
|
|
|
| 47 |
|
| 48 |
<div align="center">
|
| 49 |
|
| 50 |
+
| **Model** | **Download** |
|
| 51 |
+
| :--------------------: | --------------------------------------------------------------------------------- |
|
| 52 |
+
| SpatialLM1.1-Llama-1B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM1.1-Llama-1B) |
|
| 53 |
+
| SpatialLM1.1-Qwen-0.5B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM1.1-Qwen-0.5B) |
|
| 54 |
+
| SpatialLM1.0-Llama-1B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM-Llama-1B) |
|
| 55 |
+
| SpatialLM1.0-Qwen-0.5B | [🤗 HuggingFace](https://huggingface.co/manycore-research/SpatialLM-Qwen-0.5B) |
|
| 56 |
|
| 57 |
</div>
|
| 58 |
|
|
|
|
| 96 |
Run inference:
|
| 97 |
|
| 98 |
```bash
|
| 99 |
+
python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM1.1-Qwen-0.5B
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### Detection with user-specified categories
|
| 103 |
+
|
| 104 |
+
SpatialLM1.1 supports object detection conditioned on user-specified categories by leveraging the flexibility of LLMs.
|
| 105 |
+
|
| 106 |
+
SpatialLM1.1 offers three variants of structured indoor modeling tasks:
|
| 107 |
+
|
| 108 |
+
- **Structured Reconstruction**: Detect walls, doors, windows, boxes.
|
| 109 |
+
- **Layout Estimation**: Detect walls, doors, windows.
|
| 110 |
+
- **3D Object Detection**: Detect boxes.
|
| 111 |
+
|
| 112 |
+
For tasks that include object box estimation, you can specify a subset of the 59 furniture categories, and the model will only predict objects within those specified categories. For example:
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM1.1-Qwen-0.5B --detect_type object --category bed nightstand
|
| 116 |
```
|
| 117 |
|
| 118 |
### Visualization
|
|
|
|
| 140 |
Run evaluation:
|
| 141 |
|
| 142 |
```bash
|
| 143 |
+
# Run inference on the PLY point clouds in folder SpatialLM-Testset/pcd with SpatialLM1.1-Qwen-0.5B model
|
| 144 |
+
python inference.py --point_cloud SpatialLM-Testset/pcd --output SpatialLM-Testset/pred --model_path manycore-research/SpatialLM1.1-Qwen-0.5B
|
| 145 |
|
| 146 |
# Evaluate the predicted layouts
|
| 147 |
python eval.py --metadata SpatialLM-Testset/test.csv --gt_dir SpatialLM-Testset/layout --pred_dir SpatialLM-Testset/pred --label_mapping SpatialLM-Testset/benchmark_categories.tsv
|
| 148 |
```
|
| 149 |
|
| 150 |
+
### Example using a custom video
|
| 151 |
+
|
| 152 |
+
We provide an example of how to use our model to estimate scene layout starting from a RGB video with the newly released [SLAM3R](https://github.com/PKU-VCL-3DV/SLAM3R) in [EXAMPLE.md](EXAMPLE.md). These steps work for MASt3R-SLAM, and other reconstruction methods as well.
|
| 153 |
+
|
| 154 |
## SpatialLM Testset
|
| 155 |
|
| 156 |
We provide a test set of 107 preprocessed point clouds, reconstructed from RGB videos using [MASt3R-SLAM](https://github.com/rmurai0610/MASt3R-SLAM). SpatialLM-Testset is quite challenging compared to prior clean RGBD scans datasets due to the noises and occlusions in the point clouds reconstructed from monocular RGB videos.
|
|
|
|
| 165 |
|
| 166 |
## Benchmark Results
|
| 167 |
|
| 168 |
+
### Layout Estimation
|
| 169 |
+
|
| 170 |
+
Layout estimation focuses on predicting architectural elements, i.e., walls, doors, and windows, within an indoor scene. We evaluated this task on the [Structured3D](https://structured3d-dataset.org) dataset. For [RoomFormer](https://github.com/ywyue/RoomFormer), we directly downloaded the model checkpoint. SceneScript and SpatialLM were first trained on our dataset, and further fine-tuned on Structured3D.
|
| 171 |
|
| 172 |
<div align="center">
|
| 173 |
|
| 174 |
+
| **Method** | **RoomFormer** | **SceneScript (finetuned)** | **SpatialLM1.1-Qwen-0.5B (finetuned)** |
|
| 175 |
+
| :-------------: | :------------: | :-------------------------: | :------------------------------------: |
|
| 176 |
+
| **F1 @.25 IoU** | 70.4 | 83.1 | 86.5 |
|
| 177 |
+
| **F1 @.5 IoU** | 67.2 | 80.8 | 84.6 |
|
| 178 |
+
|
| 179 |
+
</div>
|
| 180 |
+
|
| 181 |
+
### 3D Object Detection
|
| 182 |
+
|
| 183 |
+
We evaluate 3D object detection on [ScanNet](http://www.scan-net.org) with annotations of 18 object categories. For [V-DETR](https://github.com/V-DETR/V-DETR), we directly download the model checkpoint. SceneScript and SpatialLM were first trained on our dataset, and further fine-tuned on ScanNet.
|
| 184 |
+
|
| 185 |
+
<div align="center">
|
| 186 |
+
|
| 187 |
+
| **Method** | **V-DETR** | **SceneScript (finetuned)** | **SpatialLM1.1-Qwen-0.5B (finetuned)** |
|
| 188 |
+
| :-------------: | :--------: | :-------------------------: | :------------------------------------: |
|
| 189 |
+
| **F1 @.25 IoU** | 65.1 | 49.1 | 65.6 |
|
| 190 |
+
| **F1 @.5 IoU** | 56.8 | 36.8 | 52.6 |
|
| 191 |
+
|
| 192 |
+
</div>
|
| 193 |
+
|
| 194 |
+
### Zero-shot Detection on Videos
|
| 195 |
+
|
| 196 |
+
Zero-shot detection results on the challenging SpatialLM-Testset are reported in the following table:
|
| 197 |
+
|
| 198 |
+
<div align="center">
|
| 199 |
+
|
| 200 |
+
| **Method** | **SpatialLM1.1-Llama-1B** | **SpatialLM1.1-Qwen-0.5B** |
|
| 201 |
+
| :-------------: | :-----------------------: | :------------------------: |
|
| 202 |
+
| **Layout** | **F1 @.25 IoU (2D)** | **F1 @.25 IoU (2D)** |
|
| 203 |
+
| wall | 68.9 | 68.2 |
|
| 204 |
+
| door | 46.3 | 43.1 |
|
| 205 |
+
| window | 43.8 | 47.4 |
|
| 206 |
+
| | | |
|
| 207 |
+
| **Objects** | **F1 @.25 IoU (3D)** | **F1 @.25 IoU (2D)** |
|
| 208 |
+
| curtain | 34.9 | 37.0 |
|
| 209 |
+
| nightstand | 62.8 | 67.0 |
|
| 210 |
+
| chandelier | 53.5 | 36.8 |
|
| 211 |
+
| wardrobe | 29.4 | 39.6 |
|
| 212 |
+
| bed | 96.8 | 95.2 |
|
| 213 |
+
| sofa | 66.9 | 69.1 |
|
| 214 |
+
| chair | 20.8 | 32.3 |
|
| 215 |
+
| cabinet | 15.2 | 11.2 |
|
| 216 |
+
| dining table | 40.7 | 24.2 |
|
| 217 |
+
| plants | 29.5 | 26.3 |
|
| 218 |
+
| tv cabinet | 34.4 | 27.3 |
|
| 219 |
+
| coffee table | 56.4 | 64.9 |
|
| 220 |
+
| side table | 14.6 | 9.7 |
|
| 221 |
+
| air conditioner | 16.7 | 24.0 |
|
| 222 |
+
| dresser | 46.7 | 46.7 |
|
| 223 |
+
| stool | 17.6 | 30.8 |
|
| 224 |
+
| refrigerator | 0.0 | 16.7 |
|
| 225 |
+
| painting | 34.9 | 38.2 |
|
| 226 |
+
| carpet | 40.3 | 24.1 |
|
| 227 |
+
| tv | 16.0 | 18.0 |
|
| 228 |
|
| 229 |
</div>
|
| 230 |
|
|
|
|
| 233 |
SpatialLM-Llama-1B is derived from Llama3.2-1B-Instruct, which is licensed under the Llama3.2 license.
|
| 234 |
SpatialLM-Qwen-0.5B is derived from the Qwen-2.5 series, originally licensed under the Apache 2.0 License.
|
| 235 |
|
| 236 |
+
SpatialLM1.0 are built upon the SceneScript point cloud encoder, licensed under the CC-BY-NC-4.0 License. TorchSparse, utilized in this project, is licensed under the MIT License.
|
| 237 |
+
|
| 238 |
+
SpatialLM1.1 are built upon Sonata point cloud encoder, model weight is licensed under the CC-BY-NC-4.0 License. Code built on Pointcept is licensed under the Apache 2.0 License.
|
| 239 |
|
| 240 |
## Citation
|
| 241 |
|
| 242 |
If you find this work useful, please consider citing:
|
| 243 |
|
| 244 |
```bibtex
|
| 245 |
+
@article{SpatialLM,
|
| 246 |
+
title = {SpatialLM: Training Large Language Models for Structured Indoor Modeling},
|
| 247 |
+
author = {Mao, Yongsen and Zhong, Junhao and Fang, Chuan and Zheng, Jia and Tang, Rui and Zhu, Hao and Tan, Ping and Zhou, Zihan},
|
| 248 |
+
journal = {arXiv preprint},
|
| 249 |
+
year = {2025},
|
| 250 |
+
eprint = {2506.07491},
|
| 251 |
+
archivePrefix = {arXiv},
|
| 252 |
+
primaryClass = {cs.CV}
|
| 253 |
}
|
| 254 |
```
|
| 255 |
|
|
|
|
| 257 |
|
| 258 |
We would like to thank the following projects that made this work possible:
|
| 259 |
|
| 260 |
+
[Llama3.2](https://github.com/meta-llama) | [Qwen2.5](https://github.com/QwenLM/Qwen2.5) | [Transformers](https://github.com/huggingface/transformers) | [SceneScript](https://github.com/facebookresearch/scenescript) | [TorchSparse](https://github.com/mit-han-lab/torchsparse) | [Sonata](https://xywu.me/sonata/) | [Pointcept](https://github.com/Pointcept/Pointcept)
|