File size: 9,037 Bytes
3a9d4c2 57901c8 3a9d4c2 57901c8 2cb6e9f 3a9d4c2 2cb6e9f 3a9d4c2 57901c8 3a9d4c2 57901c8 2cb6e9f 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 2cb6e9f 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 2cb6e9f 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 2cb6e9f 3a9d4c2 2cb6e9f 3a9d4c2 2cb6e9f 3a9d4c2 2cb6e9f 3a9d4c2 2cb6e9f 3a9d4c2 2cb6e9f 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 3a9d4c2 57901c8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
---
# Model Card for SAM 2: Segment Anything in Images and Videos
Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information.

## Model Details
### Model Description
SAM 2 (Segment Anything Model 2) is a foundation model developed by Meta FAIR for promptable visual segmentation across both images and videos. It extends the capabilities of the original SAM by introducing a memory-driven, streaming architecture that enables real-time, interactive segmentation and tracking of objects even as they change or temporarily disappear across video frames. SAM 2 achieves state-of-the-art segmentation accuracy with significantly improved speed and data efficiency, outperforming existing models for both images and videos.
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
- **Shared by [optional]:** [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
- **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos.
- **License:** Apache-2.0, BSD 3-Clause
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/facebookresearch/sam2
- **Paper [optional]:** https://arxiv.org/abs/2408.00714
- **Demo [optional]:** https://ai.meta.com/sam2/
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
SAM 2 is designed for:
Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts.
Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.
Real-time, interactive applications—track or segment objects across frames, allowing corrections/refinements with new prompts as needed.
Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.
## Bias, Risks, and Limitations
Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.
### Recommendations
Human-in-the-loop review is advised for critical use cases.
Users should evaluate and possibly retrain or fine-tune SAM 2 for highly specific domains.
Ethical and privacy considerations must be taken into account, especially in surveillance or sensitive settings.
## How to Get Started with the Model
```
from transformers import (
Sam2Config,
Sam2ImageProcessorFast,
Sam2MaskDecoderConfig,
Sam2MemoryAttentionConfig,
Sam2MemoryEncoderConfig,
Sam2Model,
Sam2Processor,
Sam2PromptEncoderConfig,
Sam2VideoProcessor,
Sam2VisionConfig,
)
image_processor = Sam2ImageProcessorFast()
video_processor = Sam2VideoProcessor()
processor = Sam2Processor(image_processor=image_processor, video_processor=video_processor)
sam2model = Sam2Model.from_pretrained("danelcsb/sam2.1_hiera_tiny").to("cuda")
# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
# Try to load your custom video in here
video_dir = "./videos/bedroom"
# scan all the JPEG frame names in this directory
frame_names = [
p for p in os.listdir(video_dir)
if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
]
frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
videos = []
for frame_name in frame_names:
videos.append(Image.open(os.path.join(video_dir, frame_name)))
inference_state = processor.init_video_session(video=videos, inference_device="cuda")
inference_state.reset_inference_session()
ann_frame_idx = 0 # the frame index we interact with
ann_obj_id = 1 # give a unique id to each object we interact with (it can be any integers)
points = np.array([[210, 350]], dtype=np.float32)
# for labels, `1` means positive click and `0` means negative click
labels = np.array([1], np.int32)
# Let's add a positive click at (x, y) = (210, 350) to get started
inference_state = processor.process_new_points_or_box_for_video_frame(
inference_state=inference_state,
frame_idx=ann_frame_idx,
obj_ids=ann_obj_id,
input_points=points,
input_labels=labels
)
any_res_masks, video_res_masks = sam2model.infer_on_video_frame_with_new_inputs(
inference_state=inference_state,
frame_idx=ann_frame_idx,
obj_ids=ann_obj_id,
consolidate_at_video_res=False,
)
```
## Training Details
### Training Data
Trained using a data engine that collected the largest known video segmentation dataset, SA-V (Segment Anything Video dataset), via interactive human-model collaboration.
Focused on full objects and parts, not restricted by semantic classes.
### Training Procedure
Preprocessing: Images and videos processed into masklets (spatio-temporal masks); prompts collected via human and model interaction loops.
Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Evaluated on SA-V and other standard video and image segmentation benchmarks.
#### Metrics
Segmentation accuracy (IoU, Dice). Speed/Throughput (frames per second).
SAM 2.1 checkpoints
The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.
| **Model** | **Size (M)** | **Speed (FPS)** | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
| sam2.1_hiera_tiny | 38.9 | 91.2 | 76.5 | 71.8 | 77.3 |
| sam2.1_hiera_small | 46 | 84.8 | 76.6 | 73.5 | 78.3 |
| sam2.1_hiera_base_plus| 80.8 | 64.1 | 78.2 | 73.7 | 78.2 |
| sam2.1_hiera_large | 224.4 | 39.5 | 79.5 | 74.6 | 80.6 |
SAM 2 checkpoints
The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows:
| **Model** | **Size (M)** | **Speed (FPS)** | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
| sam2_hiera_tiny | 38.9 | 91.5 | 75.0 | 70.9 | 75.3 |
| sam2_hiera_small | 46 | 85.6 | 74.9 | 71.5 | 76.4 |
| sam2_hiera_base_plus | 80.8 | 64.8 | 74.7 | 72.8 | 75.8 |
| sam2_hiera_large | 224.4 | 39.7 | 76.0 | 74.6 | 79.8 |
### Results
Video segmentation: Higher accuracy with 3x fewer user prompts versus prior approaches.
Image segmentation: 6x faster and more accurate than original SAM.
## Citation [optional]
**BibTeX:**
@article{ravi2024sam2,
title={SAM 2: Segment Anything in Images and Videos},
author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman R{\"a}dle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Doll\'ar and Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2408.00714},
year={2024}
}
**APA:**
Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.
## Model Card Authors [optional]
[Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
## Model Card Contact
Meta FAIR (contact via support@segment-anything.com) |