File size: 5,256 Bytes
3c6bc44 ba92f5f 3c6bc44 ba92f5f 3c6bc44 ba92f5f 3c6bc44 ba92f5f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
license: apache-2.0
library_name: pytorch
pipeline_tag: mask-generation
tags:
- 3d
- mesh
- 3d-part-segmentation
- sam2
- segmentation
- point-cloud
- geosam2
base_model: facebook/sam2.1-hiera-base-plus
language:
- en
---
# GeoSAM2
> Unleashing the Power of SAM2 for 3D Part Segmentation · CVPR 2026
<div align="center">
[](https://detailgen3d.github.io/GeoSAM2/)
[](https://arxiv.org/abs/2508.14036)
[](https://github.com/VAST-AI-Research/GeoSAM2)
[](https://www.apache.org/licenses/LICENSE-2.0)
</div>
GeoSAM2 lifts [SAM2](https://github.com/facebookresearch/sam2) from images to
3D meshes. Given a multi-view rendering of a mesh and an interactive prompt
(a single 2D click or a 2D mask) on one view, it propagates a consistent
segmentation across all views and back-projects the result to per-face 3D
part labels.
This repository hosts the **pretrained inference checkpoint** (`geosam2.pt`).
Code, configs, and a small multi-view demo dataset live in the companion
GitHub repo: <https://github.com/VAST-AI-Research/GeoSAM2>.
## Model summary
| | |
|---|---|
| Task | Interactive 3D part segmentation on meshes via multi-view 2D propagation |
| Base model | [`facebook/sam2.1-hiera-base-plus`](https://huggingface.co/facebook/sam2.1-hiera-base-plus) |
| Architecture | SAM2 (Hiera-B+ image encoder + memory attention + mask decoder), plus a dedicated **position-map encoder** for 3D geometry, **feature fusion**, and **LoRA adapters** on the image and position-map encoders |
| Parameters | ~154 M (fp32: ~588 MB · bf16: ~294 MB) |
| Input modalities | 12 rendered views per mesh: color (`.webp`), depth (`.exr`), normal (`.webp`), camera metadata (`meta.json`) |
| Prompts | 2D point clicks or a 2D mask on any view |
| Output | Per-view 2D label maps and per-face 3D labels for the input mesh |
| Render config | 12 azimuthally-spaced views at 1024×1024 from a fixed elevation |
## Quickstart
```bash
# 1. Clone the code
git clone https://github.com/VAST-AI-Research/GeoSAM2.git
cd GeoSAM2
pip install -r requirements.txt
pip install -e . # builds the optional CUDA op; set GEOSAM2_BUILD_CUDA=0 to skip
# 2. Download the checkpoint into ./ckpt
hf download VAST-AI/GeoSAM2 geosam2.pt --local-dir ckpt
# 3. Run the bundled demo (single-view point prompt -> 3D segmentation)
bash scripts/run_example.sh
```
Direct inference from a 2D mask:
```bash
python inference.py \
--data-root example/sample_00 \
--mask-path outputs/sample_00/2d_seg/mask_view0000.npy \
--mask-view 0 \
--postprocess-pa 0.02 \
--output-dir outputs/sample_00/3d_seg
```
See the [README](https://github.com/VAST-AI-Research/GeoSAM2#readme) for the
full usage guide, including rendering your own meshes with Blender.
## Files
| File | Size | Description |
|---|---|---|
| `geosam2.pt` | ~588 MB | Pretrained weights in `float32` (`{"model": state_dict}`). Default choice. |
| `geosam2-bf16.pt` | ~294 MB | Same weights cast to `bfloat16` for faster downloads / lower memory. Loaded by the standard code path — `load_state_dict` upcasts to the model dtype, so no extra steps are required. Expect a small reconstruction error of ≤ 0.015 per weight versus the fp32 file. |
Both checkpoints are loaded by
`sam2.build_sam.build_sam2_video_predictor_geosam2` with the Hydra config
`sam2/configs/geosam2.yaml`. Pass the chosen file via `--sam2-checkpoint`
(or use the default `ckpt/geosam2.pt` path expected by the scripts).
## Intended use
- **Intended**: interactive 3D part segmentation of single-object meshes for
research and content-creation tooling.
- **Out of scope**: scene-level segmentation, dynamic scenes, semantic
category prediction (the model produces instance-level part masks, not
semantic class labels), and safety-critical applications.
## Limitations
- Expects the 12-view rendering convention produced by `geosam2_render.py`;
arbitrary view counts or camera trajectories may degrade quality.
- The mesh must fit within the normalised cube used at render time
(`geosam2_render.py` handles this for the bundled samples).
- Performance on thin/wire-like geometry and on highly transparent surfaces
is still an open problem.
- The post-processing `--postprocess-pa` value sometimes needs hand-tuning
per mesh (`0.01`, `0.02`, `0.035` are useful starting points).
## License
Released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
The checkpoint is a derivative of Meta's
[SAM2](https://github.com/facebookresearch/sam2) (Apache 2.0); see the
[`NOTICE`](https://github.com/VAST-AI-Research/GeoSAM2/blob/main/NOTICE)
file in the code repo for attribution.
## Citation
```bibtex
@article{deng2025geosam2,
title = {GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation},
author = {Deng, Ken and Yang, Yunhan and Sun, Jingxiang and
Liu, Xihui and Liu, Yebin and Liang, Ding and Cao, Yan-Pei},
journal = {arXiv preprint arXiv:2508.14036},
year = {2025}
}
```
|