|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
--- |
|
|
# Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D |
|
|
|
|
|
Official model weights for the `Locate-3D` models and the `3D-JEPA` encoders |
|
|
|
|
|
## Locate 3D |
|
|
|
|
|
Locate 3D is a model for localizing objects in 3D scenes from referring expressions like “the |
|
|
small coffee table between the sofa and the lamp.” Locate 3D sets a new state-of-the-art on standard |
|
|
referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate |
|
|
3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world |
|
|
deployment on robots and AR devices. |
|
|
|
|
|
## 3D-JEPA |
|
|
|
|
|
3D-JEPA, a novel self-supervised |
|
|
learning (SSL) algorithm applicable to sensor point clouds, is key to `Locate 3D`. It takes as input a 3D pointcloud |
|
|
featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space |
|
|
is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. |
|
|
Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly |
|
|
predict 3D masks and bounding boxes. |
|
|
|
|
|
## Models |
|
|
|
|
|
- **Locate-3D**: Locate-3D model trained on public referential grounding datasets |
|
|
- **Locate-3D+**: Locate-3D model trained on public referential grounding datasets and the newly released Locate 3D Dataset |
|
|
- **3D-JEPA**: Pre-trained SSL encoder for 3D understanding |
|
|
|
|
|
|
|
|
## How to Use |
|
|
|
|
|
For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our [GitHub repository](https://github.com/facebookresearch/locate-3d). |
|
|
|
|
|
## License |
|
|
|
|
|
The majority of `locate-3` is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Pointcept is licensed under the MIT license. |
|
|
|