SpaCeFormer / README.md
chrischoy's picture
card: add local viser demo section
484c04d verified
|
Raw
History Blame Contribute Delete
6.73 kB
---
license: cc-by-nc-4.0
library_name: warpconvnet
pipeline_tag: image-segmentation
tags:
- 3d
- point-cloud
- instance-segmentation
- open-vocabulary
- scannet
- scannet200
- scannetpp
- replica
- spaceformer
- warpconvnet
---
# SpaceFormer β€” Open-Vocabulary 3D Instance Segmentation
**SpaceFormer** performs **proposal-free, open-vocabulary 3D instance segmentation**.
A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs
on top of the WarpConvNet [`SpaCeFormer`](https://github.com/NVlabs/WarpConvNet) sparse
point backbone. A single forward pass over an RGB point cloud produces a fixed set of
query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP
feature against text embeddings of **arbitrary class names** (SigLIP2 text encoder, with
prompt ensembling). The vocabulary is chosen at inference time β€” it is not baked into the
weights β€” so the model can be queried with any label set.
Project page: https://nvlabs.github.io/SpaCeFormer/
## Model details
- **Task:** open-vocabulary 3D instance segmentation on RGB point clouds.
- **Architecture:** WarpConvNet `SpaCeFormer` backbone (mixed space/curve sparse
attention U-Net, `ssccc` encoder) β†’ proposal-free query decoder (hidden dim 512,
200 learned queries, RoPE cross/self-attention, 3 decoder iterations) β†’ objectness +
per-point mask + per-query CLIP heads. ~85.8M parameters.
- **CLIP/text embedding:** `google/siglip2-so400m-patch14-224` (1152-d), used only at
inference to embed class names; not stored in this checkpoint.
- **Input:** point coordinates in meters + RGB; voxelized internally at 2 cm.
- **Naming:** `spaceformer_512_siglip2_ssccc` = hidden dim 512 Β· SigLIP2 embedding Β·
`ssccc` encoder attention (space, space, curve, curve, curve).
## Evaluation
Test-set mAP with the released recipe (**prompt ensembling on, TTA off, default
proposal-free post-processing**):
| Benchmark | mAP | mAP50 | recall (class-agnostic) |
|---|---:|---:|---:|
| ScanNet200 | **0.1265** | 0.210 | 0.756 |
| ScanNet++ | 0.2217 | β€” | β€” |
| Replica | 0.2644 | β€” | β€” |
## How to use
The model lives in WarpConvNet as `warpconvnet.models.spaceformer` (the backbone needs
WarpConvNet's compiled CUDA extension β€” install a pre-built wheel or build from source).
It returns **raw** predictions; open-vocab labeling + mask post-processing live in the
demo repo / HuggingFace Space, not in WarpConvNet.
```python
import torch
from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
from huggingface_hub import hf_hub_download
device = torch.device("cuda")
ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")
net = build_spaceformer(device=device)
load_spaceformer_checkpoint(net, ckpt) # 487 tensors, strict=False
# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
out = net({"coord": coord, "feat": feat, "offset": offset})
# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}
```
To turn `clip_feat` into open-vocabulary labels (SigLIP2 text + prompt ensembling) and
clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space
(`pipeline.py`, `clip_eval.py`, `text_encoder.py`, `postprocessing.py`, `labels.py`) β€”
e.g. its `inference.py` CLI or the Gradio `app.py`.
## Demo (run locally)
A small local demo lives under [`demo/`](./demo) β€” no GPU cloud / HF Space needed, run it
on your own machine (requires WarpConvNet with its compiled extension). It takes text
class names, runs segmentation, and shows the result in an interactive 3D
[**viser**](https://viser.studio) viewer:
```bash
pip install -r demo/requirements.txt # + warpconvnet (compiled)
python demo/demo_viser.py --port 8080 # uses a bundled sample point cloud
# your own scene + vocabulary:
python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other"
```
Open the printed `http://localhost:8080` β€” each predicted instance is a distinct color.
A headless CLI (`demo/inference.py`) and a Gradio app (`demo/app.py`) are also included.
## Intended use & limitations
- **Intended:** research on open-vocabulary 3D scene understanding; segmenting indoor RGB
point clouds (ScanNet-like) against custom class vocabularies.
- **Open-vocab mAP is semantics-bottlenecked:** rare/fine-grained classes are weaker than
head classes; class-agnostic mask recall is higher than the open-vocab mAP.
- **Domain:** trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D)
and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very
different sensor domains are out of distribution.
- **Large scenes:** very large clouds can exceed memory in the eval forward; the
inference code skips such a scene (single-process) rather than crashing.
## Files
- `spaceformer_512_siglip2_ssccc.ckpt` β€” weights-only Lightning `state_dict` (487
tensors; `net.*` decoder/backbone + `caption_loss.logit_scale`). Load via
`load_spaceformer_checkpoint` (strips the `net.` prefix, `strict=False`).
- `spaceformer_512_siglip2_ssccc.ckpt.provenance.json` β€” architecture, eval numbers, md5.
## License & usage
**These weights are released for non-commercial research use only, under
[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/).** They are a derivative
of datasets governed by non-commercial research Terms of Use, so they are **not** released
under the permissive Apache-2.0 license that covers the *code*.
The model was trained on the following datasets, each of which restricts use to
**non-commercial research/education** under its own terms β€” by using these weights you
agree to comply with all of them:
- **ScanNet / ScanNet200** β€” [ScanNet Terms of Use](http://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf)
- **ScanNet++** β€” [ScanNet++ Terms of Use](https://kaldir.vc.in.tum.de/scannetpp/static/scannetpp-terms-of-use.pdf)
- **ARKitScenes** β€” [Apple ARKitScenes license](https://github.com/apple/ARKitScenes/blob/main/LICENSE) (non-commercial)
- **Matterport3D** β€” [Matterport3D Terms of Use](https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf) (non-commercial academic)
Evaluation additionally used **Replica** ([Replica Research Terms](https://github.com/facebookresearch/Replica-Dataset/blob/main/LICENSE), non-commercial), zero-shot.
The accompanying **code** in [WarpConvNet](https://github.com/NVlabs/WarpConvNet) is
licensed separately under **Apache-2.0**.
> Note: this is not legal advice; for commercial use, consult the individual dataset
> licensors. Please also cite the datasets above and the SpaceFormer project.