SpaceFormer — Open-Vocabulary 3D Instance Segmentation

SpaceFormer performs proposal-free, open-vocabulary 3D instance segmentation. A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs on top of the WarpConvNet SpaCeFormer sparse point backbone. A single forward pass over an RGB point cloud produces a fixed set of query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP feature against text embeddings of arbitrary class names (SigLIP2 text encoder, with prompt ensembling). The vocabulary is chosen at inference time — it is not baked into the weights — so the model can be queried with any label set.

Project page: https://nvlabs.github.io/SpaCeFormer/

Model details

Task: open-vocabulary 3D instance segmentation on RGB point clouds.
Architecture: WarpConvNet SpaCeFormer backbone (mixed space/curve sparse attention U-Net, ssccc encoder) → proposal-free query decoder (hidden dim 512, 200 learned queries, RoPE cross/self-attention, 3 decoder iterations) → objectness + per-point mask + per-query CLIP heads. ~85.8M parameters.
CLIP/text embedding: google/siglip2-so400m-patch14-224 (1152-d), used only at inference to embed class names; not stored in this checkpoint.
Input: point coordinates in meters + RGB; voxelized internally at 2 cm.
Naming: spaceformer_512_siglip2_ssccc = hidden dim 512 · SigLIP2 embedding · ssccc encoder attention (space, space, curve, curve, curve).

Evaluation

Test-set mAP with the released recipe (prompt ensembling on, TTA off, default proposal-free post-processing):

Benchmark	mAP	mAP50	recall (class-agnostic)
ScanNet200	0.1265	0.210	0.756
ScanNet++	0.2217	—	—
Replica	0.2644	—	—

How to use

The model lives in WarpConvNet as warpconvnet.models.spaceformer (the backbone needs WarpConvNet's compiled CUDA extension — install a pre-built wheel or build from source). It returns raw predictions; open-vocab labeling + mask post-processing live in the demo repo / HuggingFace Space, not in WarpConvNet.

import torch
from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
from huggingface_hub import hf_hub_download

device = torch.device("cuda")
ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")

net = build_spaceformer(device=device)
load_spaceformer_checkpoint(net, ckpt)          # 487 tensors, strict=False

# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
out = net({"coord": coord, "feat": feat, "offset": offset})
# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}

To turn clip_feat into open-vocabulary labels (SigLIP2 text + prompt ensembling) and clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space (pipeline.py, clip_eval.py, text_encoder.py, postprocessing.py, labels.py) — e.g. its inference.py CLI or the Gradio app.py.

Demo (run locally)

A small local demo lives under demo/ — no GPU cloud / HF Space needed, run it on your own machine (requires WarpConvNet with its compiled extension). It takes text class names, runs segmentation, and shows the result in an interactive 3D viser viewer:

pip install -r demo/requirements.txt          # + warpconvnet (compiled)
python demo/demo_viser.py --port 8080         # uses a bundled sample point cloud
# your own scene + vocabulary:
python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other"

Open the printed http://localhost:8080 — each predicted instance is a distinct color. A headless CLI (demo/inference.py) and a Gradio app (demo/app.py) are also included.

Intended use & limitations

Intended: research on open-vocabulary 3D scene understanding; segmenting indoor RGB point clouds (ScanNet-like) against custom class vocabularies.
Open-vocab mAP is semantics-bottlenecked: rare/fine-grained classes are weaker than head classes; class-agnostic mask recall is higher than the open-vocab mAP.
Domain: trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D) and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very different sensor domains are out of distribution.
Large scenes: very large clouds can exceed memory in the eval forward; the inference code skips such a scene (single-process) rather than crashing.

Files

spaceformer_512_siglip2_ssccc.ckpt — weights-only Lightning state_dict (487 tensors; net.* decoder/backbone + caption_loss.logit_scale). Load via load_spaceformer_checkpoint (strips the net. prefix, strict=False).
spaceformer_512_siglip2_ssccc.ckpt.provenance.json — architecture, eval numbers, md5.

License & usage

These weights are released for non-commercial research use only, under CC-BY-NC-4.0. They are a derivative of datasets governed by non-commercial research Terms of Use, so they are not released under the permissive Apache-2.0 license that covers the code.

The model was trained on the following datasets, each of which restricts use to non-commercial research/education under its own terms — by using these weights you agree to comply with all of them:

ScanNet / ScanNet200 — ScanNet Terms of Use
ScanNet++ — ScanNet++ Terms of Use
ARKitScenes — Apple ARKitScenes license (non-commercial)
Matterport3D — Matterport3D Terms of Use (non-commercial academic)

Evaluation additionally used Replica (Replica Research Terms, non-commercial), zero-shot.

The accompanying code in WarpConvNet is licensed separately under Apache-2.0.

Note: this is not legal advice; for commercial use, consult the individual dataset licensors. Please also cite the datasets above and the SpaceFormer project.

Downloads last month: -; Downloads are not tracked for this model. How to track