Pointcept
/

Concerto

Graph Machine Learning

self-supervised-learning

Model card Files Files and versions

Concerto / README.md

yujia0913's picture

Update README.md

1df1412 verified 3 months ago

|

history blame contribute delete

3.3 kB

	---
	pipeline_tag: graph-ml
	library_name: pytorch
	license: apache-2.0
	tags:
	- 3d
	- point-cloud
	- self-supervised-learning
	---

	# Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

	This repository contains the model weights for Concerto, a novel approach for learning robust spatial representations presented in the paper [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607).

	- Paper: [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607)
	- Project Page: [https://pointcept.github.io/Concerto/](https://pointcept.github.io/Concerto)
	- Codebase: [https://github.com/Pointcept/Pointcept](https://github.com/Pointcept/Pointcept)
	- Inference: [https://github.com/Pointcept/Concerto](https://github.com/Pointcept/Concerto)

	## Models
	The default models(concerto_large/base/small/tiny) are the pre-release version of our next work, which can deal with input without color and normal. We pre-release these for general public use because many tasks lack such information. The original Concerto model is `concerto_base_origin.pth`.

	## Abstract
	Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

	## Usage
	For detailed installation, data preparation, training, and testing instructions, please refer to the [official codebase](https://github.com/Pointcept/Pointcept) and [inference demo](https://github.com/Pointcept/Concerto).

	## Citation
	If you find Concerto or the Pointcept codebase useful in your research, please cite the following papers:

	```bibtex
	@misc{pointcept2023,
	title={Pointcept: A Codebase for Point Cloud Perception Research},
	author={Pointcept Contributors},
	howpublished = {\url{https://github.com/Pointcept/Pointcept}},
	year={2023}
	}

	@article{zhang2025concerto,
	title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
	author={Zhang, Yujia and Wu, Xiaoyang and Lao, Yixing and Wang, Chengyao and Tian, Zhuotao and Wang, Naiyan and Zhao, Hengshuang},
	journal={Conference on Neural Information Processing Systems},
	year={2025},
	}
	```