File size: 9,037 Bytes
3a9d4c2
 
57901c8
 
3a9d4c2
 
57901c8
 
2cb6e9f
3a9d4c2
2cb6e9f
3a9d4c2
 
 
 
 
57901c8
3a9d4c2
 
 
57901c8
2cb6e9f
57901c8
 
3a9d4c2
 
 
 
 
57901c8
 
 
3a9d4c2
 
 
 
 
 
 
57901c8
3a9d4c2
2cb6e9f
57901c8
 
 
 
 
 
3a9d4c2
 
 
57901c8
3a9d4c2
 
 
57901c8
 
 
3a9d4c2
57901c8
3a9d4c2
 
 
2cb6e9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9d4c2
 
 
 
 
57901c8
3a9d4c2
57901c8
3a9d4c2
 
 
57901c8
 
 
3a9d4c2
 
 
 
 
 
 
 
 
57901c8
3a9d4c2
2cb6e9f
3a9d4c2
2cb6e9f
3a9d4c2
2cb6e9f
3a9d4c2
2cb6e9f
 
 
 
 
 
 
 
 
3a9d4c2
2cb6e9f
3a9d4c2
2cb6e9f
 
 
 
 
 
57901c8
3a9d4c2
 
 
57901c8
 
 
3a9d4c2
 
 
 
 
57901c8
 
 
 
 
 
3a9d4c2
 
 
57901c8
3a9d4c2
 
 
57901c8
3a9d4c2
 
 
57901c8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
---

# Model Card for SAM 2: Segment Anything in Images and Videos

Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/XzEgSzh7osnlG2QcMjWB5.png)

## Model Details

### Model Description

SAM 2 (Segment Anything Model 2) is a foundation model developed by Meta FAIR for promptable visual segmentation across both images and videos. It extends the capabilities of the original SAM by introducing a memory-driven, streaming architecture that enables real-time, interactive segmentation and tracking of objects even as they change or temporarily disappear across video frames. SAM 2 achieves state-of-the-art segmentation accuracy with significantly improved speed and data efficiency, outperforming existing models for both images and videos.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
- **Shared by [optional]:** [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
- **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos.
- **License:** Apache-2.0, BSD 3-Clause

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/facebookresearch/sam2
- **Paper [optional]:** https://arxiv.org/abs/2408.00714
- **Demo [optional]:** https://ai.meta.com/sam2/

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

SAM 2 is designed for:

Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts. 

Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.

Real-time, interactive applications—track or segment objects across frames, allowing corrections/refinements with new prompts as needed.

Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.

## Bias, Risks, and Limitations

Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.

### Recommendations

Human-in-the-loop review is advised for critical use cases.

Users should evaluate and possibly retrain or fine-tune SAM 2 for highly specific domains.

Ethical and privacy considerations must be taken into account, especially in surveillance or sensitive settings.

## How to Get Started with the Model

```
from transformers import (
    Sam2Config,
    Sam2ImageProcessorFast,
    Sam2MaskDecoderConfig,
    Sam2MemoryAttentionConfig,
    Sam2MemoryEncoderConfig,
    Sam2Model,
    Sam2Processor,
    Sam2PromptEncoderConfig,
    Sam2VideoProcessor,
    Sam2VisionConfig,
)

image_processor = Sam2ImageProcessorFast()
video_processor = Sam2VideoProcessor()
processor = Sam2Processor(image_processor=image_processor, video_processor=video_processor)

sam2model = Sam2Model.from_pretrained("danelcsb/sam2.1_hiera_tiny").to("cuda")

# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
# Try to load your custom video in here
video_dir = "./videos/bedroom"

# scan all the JPEG frame names in this directory
frame_names = [
    p for p in os.listdir(video_dir)
    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
]
frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))

videos = []
for frame_name in frame_names:
    videos.append(Image.open(os.path.join(video_dir, frame_name)))
inference_state = processor.init_video_session(video=videos, inference_device="cuda")
inference_state.reset_inference_session()

ann_frame_idx = 0  # the frame index we interact with
ann_obj_id = 1  # give a unique id to each object we interact with (it can be any integers)
points = np.array([[210, 350]], dtype=np.float32)
# for labels, `1` means positive click and `0` means negative click
labels = np.array([1], np.int32)

# Let's add a positive click at (x, y) = (210, 350) to get started
inference_state = processor.process_new_points_or_box_for_video_frame(
    inference_state=inference_state,
    frame_idx=ann_frame_idx,
    obj_ids=ann_obj_id,
    input_points=points,
    input_labels=labels
)
any_res_masks, video_res_masks = sam2model.infer_on_video_frame_with_new_inputs(
    inference_state=inference_state,
    frame_idx=ann_frame_idx,
    obj_ids=ann_obj_id,
    consolidate_at_video_res=False,
)
```

## Training Details

### Training Data

Trained using a data engine that collected the largest known video segmentation dataset, SA-V (Segment Anything Video dataset), via interactive human-model collaboration.

Focused on full objects and parts, not restricted by semantic classes.

### Training Procedure

Preprocessing: Images and videos processed into masklets (spatio-temporal masks); prompts collected via human and model interaction loops.

Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.


## Evaluation


### Testing Data, Factors & Metrics

#### Testing Data

Evaluated on SA-V and other standard video and image segmentation benchmarks.

#### Metrics

Segmentation accuracy (IoU, Dice). Speed/Throughput (frames per second).

SAM 2.1 checkpoints

The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.
|      **Model**       | **Size (M)** |    **Speed (FPS)**     | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
|   sam2.1_hiera_tiny  |     38.9     |          91.2          |        76.5         |        71.8        |       77.3        |
|   sam2.1_hiera_small |      46      |          84.8          |        76.6         |        73.5        |       78.3        |
|   sam2.1_hiera_base_plus|     80.8     |        64.1          |        78.2         |        73.7        |       78.2        |
|   sam2.1_hiera_large |    224.4     |          39.5          |        79.5         |        74.6        |       80.6        |

SAM 2 checkpoints

The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows:

|      **Model**       | **Size (M)** |    **Speed (FPS)**     | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
|   sam2_hiera_tiny    |     38.9     |          91.5          |        75.0         |        70.9        |       75.3        |
|   sam2_hiera_small   |      46      |          85.6          |        74.9         |        71.5        |       76.4        |
| sam2_hiera_base_plus |     80.8     |     64.8    |        74.7         |        72.8        |       75.8        |
|   sam2_hiera_large   |    224.4     | 39.7 |        76.0         |        74.6        |       79.8        |


### Results

Video segmentation: Higher accuracy with 3x fewer user prompts versus prior approaches.

Image segmentation: 6x faster and more accurate than original SAM.

## Citation [optional]

**BibTeX:**

@article{ravi2024sam2,
  title={SAM 2: Segment Anything in Images and Videos},
  author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman R{\"a}dle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Doll\'ar and Christoph Feichtenhofer},
  journal={arXiv preprint arXiv:2408.00714},
  year={2024}
}

**APA:**

Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.

## Model Card Authors [optional]

[Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)

## Model Card Contact

Meta FAIR (contact via support@segment-anything.com)