| --- |
| license: cc-by-nc-4.0 |
| library_name: warpconvnet |
| pipeline_tag: image-segmentation |
| tags: |
| - 3d |
| - point-cloud |
| - instance-segmentation |
| - open-vocabulary |
| - scannet |
| - scannet200 |
| - scannetpp |
| - replica |
| - spaceformer |
| - warpconvnet |
| --- |
| |
| # SpaceFormer β Open-Vocabulary 3D Instance Segmentation |
|
|
| **SpaceFormer** performs **proposal-free, open-vocabulary 3D instance segmentation**. |
| A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs |
| on top of the WarpConvNet [`SpaCeFormer`](https://github.com/NVlabs/WarpConvNet) sparse |
| point backbone. A single forward pass over an RGB point cloud produces a fixed set of |
| query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP |
| feature against text embeddings of **arbitrary class names** (SigLIP2 text encoder, with |
| prompt ensembling). The vocabulary is chosen at inference time β it is not baked into the |
| weights β so the model can be queried with any label set. |
|
|
| Project page: https://nvlabs.github.io/SpaCeFormer/ |
|
|
| ## Model details |
|
|
| - **Task:** open-vocabulary 3D instance segmentation on RGB point clouds. |
| - **Architecture:** WarpConvNet `SpaCeFormer` backbone (mixed space/curve sparse |
| attention U-Net, `ssccc` encoder) β proposal-free query decoder (hidden dim 512, |
| 200 learned queries, RoPE cross/self-attention, 3 decoder iterations) β objectness + |
| per-point mask + per-query CLIP heads. ~85.8M parameters. |
| - **CLIP/text embedding:** `google/siglip2-so400m-patch14-224` (1152-d), used only at |
| inference to embed class names; not stored in this checkpoint. |
| - **Input:** point coordinates in meters + RGB; voxelized internally at 2 cm. |
| - **Naming:** `spaceformer_512_siglip2_ssccc` = hidden dim 512 Β· SigLIP2 embedding Β· |
| `ssccc` encoder attention (space, space, curve, curve, curve). |
|
|
| ## Evaluation |
|
|
| Test-set mAP with the released recipe (**prompt ensembling on, TTA off, default |
| proposal-free post-processing**): |
|
|
| | Benchmark | mAP | mAP50 | recall (class-agnostic) | |
| |---|---:|---:|---:| |
| | ScanNet200 | **0.1265** | 0.210 | 0.756 | |
| | ScanNet++ | 0.2217 | β | β | |
| | Replica | 0.2644 | β | β | |
|
|
| ## How to use |
|
|
| The model lives in WarpConvNet as `warpconvnet.models.spaceformer` (the backbone needs |
| WarpConvNet's compiled CUDA extension β install a pre-built wheel or build from source). |
| It returns **raw** predictions; open-vocab labeling + mask post-processing live in the |
| demo repo / HuggingFace Space, not in WarpConvNet. |
|
|
| ```python |
| import torch |
| from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint |
| from huggingface_hub import hf_hub_download |
| |
| device = torch.device("cuda") |
| ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt") |
| |
| net = build_spaceformer(device=device) |
| load_spaceformer_checkpoint(net, ckpt) # 487 tensors, strict=False |
| |
| # coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N] |
| out = net({"coord": coord, "feat": feat, "offset": offset}) |
| # raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]} |
| ``` |
|
|
| To turn `clip_feat` into open-vocabulary labels (SigLIP2 text + prompt ensembling) and |
| clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space |
| (`pipeline.py`, `clip_eval.py`, `text_encoder.py`, `postprocessing.py`, `labels.py`) β |
| e.g. its `inference.py` CLI or the Gradio `app.py`. |
|
|
| ## Demo (run locally) |
|
|
| A small local demo lives under [`demo/`](./demo) β no GPU cloud / HF Space needed, run it |
| on your own machine (requires WarpConvNet with its compiled extension). It takes text |
| class names, runs segmentation, and shows the result in an interactive 3D |
| [**viser**](https://viser.studio) viewer: |
|
|
| ```bash |
| pip install -r demo/requirements.txt # + warpconvnet (compiled) |
| python demo/demo_viser.py --port 8080 # uses a bundled sample point cloud |
| # your own scene + vocabulary: |
| python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other" |
| ``` |
|
|
| Open the printed `http://localhost:8080` β each predicted instance is a distinct color. |
| A headless CLI (`demo/inference.py`) and a Gradio app (`demo/app.py`) are also included. |
|
|
| ## Intended use & limitations |
|
|
| - **Intended:** research on open-vocabulary 3D scene understanding; segmenting indoor RGB |
| point clouds (ScanNet-like) against custom class vocabularies. |
| - **Open-vocab mAP is semantics-bottlenecked:** rare/fine-grained classes are weaker than |
| head classes; class-agnostic mask recall is higher than the open-vocab mAP. |
| - **Domain:** trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D) |
| and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very |
| different sensor domains are out of distribution. |
| - **Large scenes:** very large clouds can exceed memory in the eval forward; the |
| inference code skips such a scene (single-process) rather than crashing. |
|
|
| ## Files |
|
|
| - `spaceformer_512_siglip2_ssccc.ckpt` β weights-only Lightning `state_dict` (487 |
| tensors; `net.*` decoder/backbone + `caption_loss.logit_scale`). Load via |
| `load_spaceformer_checkpoint` (strips the `net.` prefix, `strict=False`). |
| - `spaceformer_512_siglip2_ssccc.ckpt.provenance.json` β architecture, eval numbers, md5. |
|
|
| ## License & usage |
|
|
| **These weights are released for non-commercial research use only, under |
| [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/).** They are a derivative |
| of datasets governed by non-commercial research Terms of Use, so they are **not** released |
| under the permissive Apache-2.0 license that covers the *code*. |
|
|
| The model was trained on the following datasets, each of which restricts use to |
| **non-commercial research/education** under its own terms β by using these weights you |
| agree to comply with all of them: |
|
|
| - **ScanNet / ScanNet200** β [ScanNet Terms of Use](http://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf) |
| - **ScanNet++** β [ScanNet++ Terms of Use](https://kaldir.vc.in.tum.de/scannetpp/static/scannetpp-terms-of-use.pdf) |
| - **ARKitScenes** β [Apple ARKitScenes license](https://github.com/apple/ARKitScenes/blob/main/LICENSE) (non-commercial) |
| - **Matterport3D** β [Matterport3D Terms of Use](https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf) (non-commercial academic) |
|
|
| Evaluation additionally used **Replica** ([Replica Research Terms](https://github.com/facebookresearch/Replica-Dataset/blob/main/LICENSE), non-commercial), zero-shot. |
|
|
| The accompanying **code** in [WarpConvNet](https://github.com/NVlabs/WarpConvNet) is |
| licensed separately under **Apache-2.0**. |
|
|
| > Note: this is not legal advice; for commercial use, consult the individual dataset |
| > licensors. Please also cite the datasets above and the SpaceFormer project. |
|
|