Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,14 @@ This repository contains the model weights for **Concerto**, a novel approach fo
|
|
| 18 |
- **Inference:** [https://github.com/Pointcept/Concerto](https://github.com/Pointcept/Concerto)
|
| 19 |
|
| 20 |
## Models
|
| 21 |
-
The default models(concerto_large/base/small/tiny) are the pre-release version of our next work, which can deal with input without color and normal. We pre-release these for general public use because many tasks lack such information. The original Concerto model is `concerto_base_origin.pth`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Abstract
|
| 24 |
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
|
|
|
|
| 18 |
- **Inference:** [https://github.com/Pointcept/Concerto](https://github.com/Pointcept/Concerto)
|
| 19 |
|
| 20 |
## Models
|
| 21 |
+
The default models(concerto_large/base/small/tiny) are the pre-release version of our next work, which can deal with input without color and normal. If colors and normals are not available, please set them to zeros. We pre-release these for general public use because many tasks lack such information. The original Concerto model is `concerto_base_origin.pth`.
|
| 22 |
+
| Model Size | Channels | Depths | Head nums |
|
| 23 |
+
| --- | --- | --- | --- |
|
| 24 |
+
| Tiny | (16, 32, 64, 128, 256) | (1, 1, 1, 3, 1) | (1, 2, 4, 8, 16) |
|
| 25 |
+
| Small | (32, 64, 128, 256, 512) | (2, 2, 2, 6, 2) | (2, 4, 8, 16, 32) |
|
| 26 |
+
| Base | (48, 96, 192, 384, 512) | (3, 3, 3, 12, 3) | (3, 6, 12, 24, 32) |
|
| 27 |
+
| Large | (64, 128, 256, 512, 768) | (3, 3, 3, 12, 3) | (4, 8, 16, 32, 48) |
|
| 28 |
+
|
| 29 |
|
| 30 |
## Abstract
|
| 31 |
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
|