card: add local viser demo section

484c04d verified 3 days ago

6.73 kB

	---
	license: cc-by-nc-4.0
	library_name: warpconvnet
	pipeline_tag: image-segmentation
	tags:
	- 3d
	- point-cloud
	- instance-segmentation
	- open-vocabulary
	- scannet
	- scannet200
	- scannetpp
	- replica
	- spaceformer
	- warpconvnet
	---

	# SpaceFormer — Open-Vocabulary 3D Instance Segmentation

	SpaceFormer performs proposal-free, open-vocabulary 3D instance segmentation.
	A Mask2Former-style query decoder (learned queries + rotary position embeddings) runs
	on top of the WarpConvNet [`SpaCeFormer`](https://github.com/NVlabs/WarpConvNet) sparse
	point backbone. A single forward pass over an RGB point cloud produces a fixed set of
	query masks plus a per-query CLIP feature; each mask is labeled by comparing its CLIP
	feature against text embeddings of arbitrary class names (SigLIP2 text encoder, with
	prompt ensembling). The vocabulary is chosen at inference time — it is not baked into the
	weights — so the model can be queried with any label set.

	Project page: https://nvlabs.github.io/SpaCeFormer/

	## Model details

	- Task: open-vocabulary 3D instance segmentation on RGB point clouds.
	- Architecture: WarpConvNet `SpaCeFormer` backbone (mixed space/curve sparse
	attention U-Net, `ssccc` encoder) → proposal-free query decoder (hidden dim 512,
	200 learned queries, RoPE cross/self-attention, 3 decoder iterations) → objectness +
	per-point mask + per-query CLIP heads. ~85.8M parameters.
	- CLIP/text embedding: `google/siglip2-so400m-patch14-224` (1152-d), used only at
	inference to embed class names; not stored in this checkpoint.
	- Input: point coordinates in meters + RGB; voxelized internally at 2 cm.
	- Naming: `spaceformer_512_siglip2_ssccc` = hidden dim 512 · SigLIP2 embedding ·
	`ssccc` encoder attention (space, space, curve, curve, curve).

	## Evaluation

	Test-set mAP with the released recipe (**prompt ensembling on, TTA off, default
	proposal-free post-processing**):

	\| Benchmark \| mAP \| mAP50 \| recall (class-agnostic) \|
	\|---\|---:\|---:\|---:\|
	\| ScanNet200 \| 0.1265 \| 0.210 \| 0.756 \|
	\| ScanNet++ \| 0.2217 \| — \| — \|
	\| Replica \| 0.2644 \| — \| — \|

	## How to use

	The model lives in WarpConvNet as `warpconvnet.models.spaceformer` (the backbone needs
	WarpConvNet's compiled CUDA extension — install a pre-built wheel or build from source).
	It returns raw predictions; open-vocab labeling + mask post-processing live in the
	demo repo / HuggingFace Space, not in WarpConvNet.

	```python
	import torch
	from warpconvnet.models.spaceformer import build_spaceformer, load_spaceformer_checkpoint
	from huggingface_hub import hf_hub_download

	device = torch.device("cuda")
	ckpt = hf_hub_download("chrischoy/SpaCeFormer", "spaceformer_512_siglip2_ssccc.ckpt")

	net = build_spaceformer(device=device)
	load_spaceformer_checkpoint(net, ckpt) # 487 tensors, strict=False

	# coord [N,3] float meters; feat [N,3] RGB in [-1,1]; offset [0, N]
	out = net({"coord": coord, "feat": feat, "offset": offset})
	# raw outputs: {"logit":[B,Q,2], "mask":List[[N,Q]], "clip_feat":[B,Q,1152]}
	```

	To turn `clip_feat` into open-vocabulary labels (SigLIP2 text + prompt ensembling) and
	clean up masks (NMS/min-points), use the inference pipeline in the demo repo / Space
	(`pipeline.py`, `clip_eval.py`, `text_encoder.py`, `postprocessing.py`, `labels.py`) —
	e.g. its `inference.py` CLI or the Gradio `app.py`.

	## Demo (run locally)

	A small local demo lives under [`demo/`](./demo) — no GPU cloud / HF Space needed, run it
	on your own machine (requires WarpConvNet with its compiled extension). It takes text
	class names, runs segmentation, and shows the result in an interactive 3D
	[viser](https://viser.studio) viewer:

	```bash
	pip install -r demo/requirements.txt # + warpconvnet (compiled)
	python demo/demo_viser.py --port 8080 # uses a bundled sample point cloud
	# your own scene + vocabulary:
	python demo/demo_viser.py --ply my_scene.ply --class-names "chair" "table" "lamp" "other"
	```

	Open the printed `http://localhost:8080` — each predicted instance is a distinct color.
	A headless CLI (`demo/inference.py`) and a Gradio app (`demo/app.py`) are also included.

	## Intended use & limitations

	- Intended: research on open-vocabulary 3D scene understanding; segmenting indoor RGB
	point clouds (ScanNet-like) against custom class vocabularies.
	- Open-vocab mAP is semantics-bottlenecked: rare/fine-grained classes are weaker than
	head classes; class-agnostic mask recall is higher than the open-vocab mAP.
	- Domain: trained on indoor scenes (ScanNet, ScanNet++, ARKitScenes, Matterport3D)
	and evaluated on ScanNet200 / ScanNet++ / Replica (Replica zero-shot); outdoor or very
	different sensor domains are out of distribution.
	- Large scenes: very large clouds can exceed memory in the eval forward; the
	inference code skips such a scene (single-process) rather than crashing.

	## Files

	- `spaceformer_512_siglip2_ssccc.ckpt` — weights-only Lightning `state_dict` (487
	tensors; `net.*` decoder/backbone + `caption_loss.logit_scale`). Load via
	`load_spaceformer_checkpoint` (strips the `net.` prefix, `strict=False`).
	- `spaceformer_512_siglip2_ssccc.ckpt.provenance.json` — architecture, eval numbers, md5.

	## License & usage

	**These weights are released for non-commercial research use only, under
	[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/).** They are a derivative
	of datasets governed by non-commercial research Terms of Use, so they are not released
	under the permissive Apache-2.0 license that covers the code.

	The model was trained on the following datasets, each of which restricts use to
	non-commercial research/education under its own terms — by using these weights you
	agree to comply with all of them:

	- ScanNet / ScanNet200 — [ScanNet Terms of Use](http://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf)
	- ScanNet++ — [ScanNet++ Terms of Use](https://kaldir.vc.in.tum.de/scannetpp/static/scannetpp-terms-of-use.pdf)
	- ARKitScenes — [Apple ARKitScenes license](https://github.com/apple/ARKitScenes/blob/main/LICENSE) (non-commercial)
	- Matterport3D — [Matterport3D Terms of Use](https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf) (non-commercial academic)

	Evaluation additionally used Replica ([Replica Research Terms](https://github.com/facebookresearch/Replica-Dataset/blob/main/LICENSE), non-commercial), zero-shot.

	The accompanying code in [WarpConvNet](https://github.com/NVlabs/WarpConvNet) is
	licensed separately under Apache-2.0.

	> Note: this is not legal advice; for commercial use, consult the individual dataset
	> licensors. Please also cite the datasets above and the SpaceFormer project.