|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-segmentation |
|
|
--- |
|
|
|
|
|
# Model Card for SAM 2: Segment Anything in Images and Videos |
|
|
|
|
|
Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information. |
|
|
|
|
|
 |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
SAM 2 (Segment Anything Model 2) is a foundation model developed by Meta FAIR for promptable visual segmentation across both images and videos. It extends the capabilities of the original SAM by introducing a memory-driven, streaming architecture that enables real-time, interactive segmentation and tracking of objects even as they change or temporarily disappear across video frames. SAM 2 achieves state-of-the-art segmentation accuracy with significantly improved speed and data efficiency, outperforming existing models for both images and videos. |
|
|
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
|
|
- **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer. |
|
|
- **Shared by [optional]:** [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan) |
|
|
- **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos. |
|
|
- **License:** Apache-2.0, BSD 3-Clause |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://github.com/facebookresearch/sam2 |
|
|
- **Paper [optional]:** https://arxiv.org/abs/2408.00714 |
|
|
- **Demo [optional]:** https://ai.meta.com/sam2/ |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
SAM 2 is designed for: |
|
|
|
|
|
Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts. |
|
|
|
|
|
Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training. |
|
|
|
|
|
Real-time, interactive applications—track or segment objects across frames, allowing corrections/refinements with new prompts as needed. |
|
|
|
|
|
Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Human-in-the-loop review is advised for critical use cases. |
|
|
|
|
|
Users should evaluate and possibly retrain or fine-tune SAM 2 for highly specific domains. |
|
|
|
|
|
Ethical and privacy considerations must be taken into account, especially in surveillance or sensitive settings. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
``` |
|
|
from transformers import ( |
|
|
Sam2Config, |
|
|
Sam2ImageProcessorFast, |
|
|
Sam2MaskDecoderConfig, |
|
|
Sam2MemoryAttentionConfig, |
|
|
Sam2MemoryEncoderConfig, |
|
|
Sam2Model, |
|
|
Sam2Processor, |
|
|
Sam2PromptEncoderConfig, |
|
|
Sam2VideoProcessor, |
|
|
Sam2VisionConfig, |
|
|
) |
|
|
|
|
|
image_processor = Sam2ImageProcessorFast() |
|
|
video_processor = Sam2VideoProcessor() |
|
|
processor = Sam2Processor(image_processor=image_processor, video_processor=video_processor) |
|
|
|
|
|
sam2model = Sam2Model.from_pretrained("danelcsb/sam2.1_hiera_tiny").to("cuda") |
|
|
|
|
|
# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg` |
|
|
# Try to load your custom video in here |
|
|
video_dir = "./videos/bedroom" |
|
|
|
|
|
# scan all the JPEG frame names in this directory |
|
|
frame_names = [ |
|
|
p for p in os.listdir(video_dir) |
|
|
if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"] |
|
|
] |
|
|
frame_names.sort(key=lambda p: int(os.path.splitext(p)[0])) |
|
|
|
|
|
videos = [] |
|
|
for frame_name in frame_names: |
|
|
videos.append(Image.open(os.path.join(video_dir, frame_name))) |
|
|
inference_state = processor.init_video_session(video=videos, inference_device="cuda") |
|
|
inference_state.reset_inference_session() |
|
|
|
|
|
ann_frame_idx = 0 # the frame index we interact with |
|
|
ann_obj_id = 1 # give a unique id to each object we interact with (it can be any integers) |
|
|
points = np.array([[210, 350]], dtype=np.float32) |
|
|
# for labels, `1` means positive click and `0` means negative click |
|
|
labels = np.array([1], np.int32) |
|
|
|
|
|
# Let's add a positive click at (x, y) = (210, 350) to get started |
|
|
inference_state = processor.process_new_points_or_box_for_video_frame( |
|
|
inference_state=inference_state, |
|
|
frame_idx=ann_frame_idx, |
|
|
obj_ids=ann_obj_id, |
|
|
input_points=points, |
|
|
input_labels=labels |
|
|
) |
|
|
any_res_masks, video_res_masks = sam2model.infer_on_video_frame_with_new_inputs( |
|
|
inference_state=inference_state, |
|
|
frame_idx=ann_frame_idx, |
|
|
obj_ids=ann_obj_id, |
|
|
consolidate_at_video_res=False, |
|
|
) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
Trained using a data engine that collected the largest known video segmentation dataset, SA-V (Segment Anything Video dataset), via interactive human-model collaboration. |
|
|
|
|
|
Focused on full objects and parts, not restricted by semantic classes. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
Preprocessing: Images and videos processed into masklets (spatio-temporal masks); prompts collected via human and model interaction loops. |
|
|
|
|
|
Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets. |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
Evaluated on SA-V and other standard video and image segmentation benchmarks. |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
Segmentation accuracy (IoU, Dice). Speed/Throughput (frames per second). |
|
|
|
|
|
SAM 2.1 checkpoints |
|
|
|
|
|
The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024. |
|
|
| **Model** | **Size (M)** | **Speed (FPS)** | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** | |
|
|
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: | |
|
|
| sam2.1_hiera_tiny | 38.9 | 91.2 | 76.5 | 71.8 | 77.3 | |
|
|
| sam2.1_hiera_small | 46 | 84.8 | 76.6 | 73.5 | 78.3 | |
|
|
| sam2.1_hiera_base_plus| 80.8 | 64.1 | 78.2 | 73.7 | 78.2 | |
|
|
| sam2.1_hiera_large | 224.4 | 39.5 | 79.5 | 74.6 | 80.6 | |
|
|
|
|
|
SAM 2 checkpoints |
|
|
|
|
|
The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows: |
|
|
|
|
|
| **Model** | **Size (M)** | **Speed (FPS)** | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** | |
|
|
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: | |
|
|
| sam2_hiera_tiny | 38.9 | 91.5 | 75.0 | 70.9 | 75.3 | |
|
|
| sam2_hiera_small | 46 | 85.6 | 74.9 | 71.5 | 76.4 | |
|
|
| sam2_hiera_base_plus | 80.8 | 64.8 | 74.7 | 72.8 | 75.8 | |
|
|
| sam2_hiera_large | 224.4 | 39.7 | 76.0 | 74.6 | 79.8 | |
|
|
|
|
|
|
|
|
### Results |
|
|
|
|
|
Video segmentation: Higher accuracy with 3x fewer user prompts versus prior approaches. |
|
|
|
|
|
Image segmentation: 6x faster and more accurate than original SAM. |
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
@article{ravi2024sam2, |
|
|
title={SAM 2: Segment Anything in Images and Videos}, |
|
|
author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman R{\"a}dle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Doll\'ar and Christoph Feichtenhofer}, |
|
|
journal={arXiv preprint arXiv:2408.00714}, |
|
|
year={2024} |
|
|
} |
|
|
|
|
|
**APA:** |
|
|
|
|
|
Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714. |
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
|
|
[Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
Meta FAIR (contact via support@segment-anything.com) |