SpaceFormer β Open-Vocabulary 3D Instance Segmentation
SpaceFormer performs proposal-free, open-vocabulary 3D instance segmentation.
A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs
on top of the WarpConvNet SpaCeFormer sparse
point backbone. A single forward pass over an RGB point cloud produces a fixed set of
query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP
feature against text embeddings of arbitrary class names (SigLIP2 text encoder, with
prompt ensembling). The vocabulary is chosen at inference time β it is not baked into the
weights β so the model can be queried with any label set.
Project page: https://nvlabs.github.io/SpaCeFormer/
Model details
- Task: open-vocabulary 3D instance segmentation on RGB point clouds.
- Architecture: WarpConvNet
SpaCeFormerbackbone (mixed space/curve sparse attention U-Net,sscccencoder) β proposal-free query decoder (hidden dim 512, 200 learned queries, RoPE cross/self-attention, 3 decoder iterations) β objectness + per-point mask + per-query CLIP heads. ~85.8M parameters. - CLIP/text embedding:
google/siglip2-so400m-patch14-224(1152-d), used only at inference to embed class names; not stored in this checkpoint. - Input: point coordinates in meters + RGB; voxelized internally at 2 cm.
- Naming:
spaceformer_512_siglip2_ssccc= hidden dim 512 Β· SigLIP2 embedding Β·sscccencoder attention (space, space, curve, curve, curve).
Evaluation
Test-set mAP with the released recipe (prompt ensembling on, TTA off, default proposal-free post-processing):
| Benchmark | mAP | mAP50 | recall (class-agnostic) |
|---|---|---|---|
| ScanNet200 | 0.1265 | 0.210 | 0.756 |
| ScanNet++ | 0.2217 | β | β |
| Replica | 0.2644 | β | β |
How to use
The model lives in WarpConvNet as warpconvnet.models.spaceformer (the backbone needs
WarpConvNet's compiled CUDA extension β install a pre-built wheel or build from source).
It returns raw predictions; open-vocab labeling + mask post-processing live in the
demo repo / HuggingFace Space, not in WarpConvNet.
import torch
from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
from huggingface_hub import hf_hub_download
device = torch.device("cuda")
ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")
net = build_spaceformer(device=device)
load_spaceformer_checkpoint(net, ckpt) # 487 tensors, strict=False
# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
out = net({"coord": coord, "feat": feat, "offset": offset})
# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}
To turn clip_feat into open-vocabulary labels (SigLIP2 text + prompt ensembling) and
clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space
(pipeline.py, clip_eval.py, text_encoder.py, postprocessing.py, labels.py) β
e.g. its inference.py CLI or the Gradio app.py.
Demo (run locally)
A small local demo lives under demo/ β no GPU cloud / HF Space needed, run it
on your own machine (requires WarpConvNet with its compiled extension). It takes text
class names, runs segmentation, and shows the result in an interactive 3D
viser viewer:
pip install -r demo/requirements.txt # + warpconvnet (compiled)
python demo/demo_viser.py --port 8080 # uses a bundled sample point cloud
# your own scene + vocabulary:
python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other"
Open the printed http://localhost:8080 β each predicted instance is a distinct color.
A headless CLI (demo/inference.py) and a Gradio app (demo/app.py) are also included.
Intended use & limitations
- Intended: research on open-vocabulary 3D scene understanding; segmenting indoor RGB point clouds (ScanNet-like) against custom class vocabularies.
- Open-vocab mAP is semantics-bottlenecked: rare/fine-grained classes are weaker than head classes; class-agnostic mask recall is higher than the open-vocab mAP.
- Domain: trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D) and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very different sensor domains are out of distribution.
- Large scenes: very large clouds can exceed memory in the eval forward; the inference code skips such a scene (single-process) rather than crashing.
Files
spaceformer_512_siglip2_ssccc.ckptβ weights-only Lightningstate_dict(487 tensors;net.*decoder/backbone +caption_loss.logit_scale). Load viaload_spaceformer_checkpoint(strips thenet.prefix,strict=False).spaceformer_512_siglip2_ssccc.ckpt.provenance.jsonβ architecture, eval numbers, md5.
License & usage
These weights are released for non-commercial research use only, under CC-BY-NC-4.0. They are a derivative of datasets governed by non-commercial research Terms of Use, so they are not released under the permissive Apache-2.0 license that covers the code.
The model was trained on the following datasets, each of which restricts use to non-commercial research/education under its own terms β by using these weights you agree to comply with all of them:
- ScanNet / ScanNet200 β ScanNet Terms of Use
- ScanNet++ β ScanNet++ Terms of Use
- ARKitScenes β Apple ARKitScenes license (non-commercial)
- Matterport3D β Matterport3D Terms of Use (non-commercial academic)
Evaluation additionally used Replica (Replica Research Terms, non-commercial), zero-shot.
The accompanying code in WarpConvNet is licensed separately under Apache-2.0.
Note: this is not legal advice; for commercial use, consult the individual dataset licensors. Please also cite the datasets above and the SpaceFormer project.