danelcsb commited on
Commit
2cb6e9f
·
verified ·
1 Parent(s): 57901c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -97
README.md CHANGED
@@ -6,10 +6,9 @@ pipeline_tag: image-segmentation
6
 
7
  # Model Card for SAM 2: Segment Anything in Images and Videos
8
 
 
9
 
10
- <!-- Provide a quick summary of what the model is/does. -->
11
-
12
-
13
 
14
  ## Model Details
15
 
@@ -20,12 +19,9 @@ SAM 2 (Segment Anything Model 2) is a foundation model developed by Meta FAIR fo
20
  This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
21
 
22
  - **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
23
- - **Funded by [optional]:** [More Information Needed]
24
- - **Shared by [optional]:** [More Information Needed]
25
  - **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos.
26
- - **Language(s) (NLP):** [More Information Needed]
27
  - **License:** Apache-2.0, BSD 3-Clause
28
- - **Finetuned from model [optional]:** [More Information Needed]
29
 
30
  ### Model Sources [optional]
31
 
@@ -43,7 +39,7 @@ This is the model card of a 🤗 transformers model that has been pushed on the
43
 
44
  SAM 2 is designed for:
45
 
46
- Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts.
47
 
48
  Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.
49
 
@@ -51,18 +47,6 @@ Real-time, interactive applications—track or segment objects across frames, al
51
 
52
  Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.
53
 
54
- ### Downstream Use [optional]
55
-
56
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
57
-
58
- [More Information Needed]
59
-
60
- ### Out-of-Scope Use
61
-
62
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
63
-
64
- [More Information Needed]
65
-
66
  ## Bias, Risks, and Limitations
67
 
68
  Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.
@@ -77,9 +61,64 @@ Ethical and privacy considerations must be taken into account, especially in sur
77
 
78
  ## How to Get Started with the Model
79
 
80
- Use the code below to get started with the model.
81
-
82
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ## Training Details
85
 
@@ -95,24 +134,9 @@ Preprocessing: Images and videos processed into masklets (spatio-temporal masks)
95
 
96
  Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.
97
 
98
- #### Preprocessing [optional]
99
-
100
- [More Information Needed]
101
-
102
-
103
- #### Training Hyperparameters
104
-
105
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
106
-
107
- #### Speeds, Sizes, Times [optional]
108
-
109
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
110
-
111
- [More Information Needed]
112
 
113
  ## Evaluation
114
 
115
- <!-- This section describes the evaluation protocols and provides the results. -->
116
 
117
  ### Testing Data, Factors & Metrics
118
 
@@ -120,19 +144,31 @@ Training regime: Used standard transformer training routines with enhancements f
120
 
121
  Evaluated on SA-V and other standard video and image segmentation benchmarks.
122
 
123
- #### Factors
124
 
125
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
126
 
127
- [More Information Needed]
128
 
129
- #### Metrics
 
 
 
 
 
 
 
 
130
 
131
- Segmentation accuracy (IoU, Dice).
132
 
133
- Prompt efficiency (number of user interactions to achieve target quality).
 
 
 
 
 
134
 
135
- Speed/Throughput (frames per second).
136
 
137
  ### Results
138
 
@@ -140,50 +176,8 @@ Video segmentation: Higher accuracy with 3x fewer user prompts versus prior appr
140
 
141
  Image segmentation: 6x faster and more accurate than original SAM.
142
 
143
- #### Summary
144
-
145
-
146
-
147
- ## Model Examination [optional]
148
-
149
- <!-- Relevant interpretability work for the model goes here -->
150
-
151
- [More Information Needed]
152
-
153
- ## Environmental Impact
154
-
155
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
156
-
157
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
158
-
159
- - **Hardware Type:** [More Information Needed]
160
- - **Hours used:** [More Information Needed]
161
- - **Cloud Provider:** [More Information Needed]
162
- - **Compute Region:** [More Information Needed]
163
- - **Carbon Emitted:** [More Information Needed]
164
-
165
- ## Technical Specifications [optional]
166
-
167
- ### Model Architecture and Objective
168
-
169
- [More Information Needed]
170
-
171
- ### Compute Infrastructure
172
-
173
- [More Information Needed]
174
-
175
- #### Hardware
176
-
177
- [More Information Needed]
178
-
179
- #### Software
180
-
181
- [More Information Needed]
182
-
183
  ## Citation [optional]
184
 
185
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
186
-
187
  **BibTeX:**
188
 
189
  @article{ravi2024sam2,
@@ -197,16 +191,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
197
 
198
  Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.
199
 
200
- ## Glossary [optional]
201
-
202
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
203
-
204
- [More Information Needed]
205
-
206
- ## More Information [optional]
207
-
208
- [More Information Needed]
209
-
210
  ## Model Card Authors [optional]
211
 
212
  [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
 
6
 
7
  # Model Card for SAM 2: Segment Anything in Images and Videos
8
 
9
+ Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information.
10
 
11
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/XzEgSzh7osnlG2QcMjWB5.png)
 
 
12
 
13
  ## Model Details
14
 
 
19
  This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
20
 
21
  - **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
22
+ - **Shared by [optional]:** [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
 
23
  - **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos.
 
24
  - **License:** Apache-2.0, BSD 3-Clause
 
25
 
26
  ### Model Sources [optional]
27
 
 
39
 
40
  SAM 2 is designed for:
41
 
42
+ Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts.
43
 
44
  Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.
45
 
 
47
 
48
  Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.
49
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ## Bias, Risks, and Limitations
51
 
52
  Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.
 
61
 
62
  ## How to Get Started with the Model
63
 
64
+ ```
65
+ from transformers import (
66
+ Sam2Config,
67
+ Sam2ImageProcessorFast,
68
+ Sam2MaskDecoderConfig,
69
+ Sam2MemoryAttentionConfig,
70
+ Sam2MemoryEncoderConfig,
71
+ Sam2Model,
72
+ Sam2Processor,
73
+ Sam2PromptEncoderConfig,
74
+ Sam2VideoProcessor,
75
+ Sam2VisionConfig,
76
+ )
77
+
78
+ image_processor = Sam2ImageProcessorFast()
79
+ video_processor = Sam2VideoProcessor()
80
+ processor = Sam2Processor(image_processor=image_processor, video_processor=video_processor)
81
+
82
+ sam2model = Sam2Model.from_pretrained("danelcsb/sam2.1_hiera_tiny").to("cuda")
83
+
84
+ # `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
85
+ # Try to load your custom video in here
86
+ video_dir = "./videos/bedroom"
87
+
88
+ # scan all the JPEG frame names in this directory
89
+ frame_names = [
90
+ p for p in os.listdir(video_dir)
91
+ if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
92
+ ]
93
+ frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
94
+
95
+ videos = []
96
+ for frame_name in frame_names:
97
+ videos.append(Image.open(os.path.join(video_dir, frame_name)))
98
+ inference_state = processor.init_video_session(video=videos, inference_device="cuda")
99
+ inference_state.reset_inference_session()
100
+
101
+ ann_frame_idx = 0 # the frame index we interact with
102
+ ann_obj_id = 1 # give a unique id to each object we interact with (it can be any integers)
103
+ points = np.array([[210, 350]], dtype=np.float32)
104
+ # for labels, `1` means positive click and `0` means negative click
105
+ labels = np.array([1], np.int32)
106
+
107
+ # Let's add a positive click at (x, y) = (210, 350) to get started
108
+ inference_state = processor.process_new_points_or_box_for_video_frame(
109
+ inference_state=inference_state,
110
+ frame_idx=ann_frame_idx,
111
+ obj_ids=ann_obj_id,
112
+ input_points=points,
113
+ input_labels=labels
114
+ )
115
+ any_res_masks, video_res_masks = sam2model.infer_on_video_frame_with_new_inputs(
116
+ inference_state=inference_state,
117
+ frame_idx=ann_frame_idx,
118
+ obj_ids=ann_obj_id,
119
+ consolidate_at_video_res=False,
120
+ )
121
+ ```
122
 
123
  ## Training Details
124
 
 
134
 
135
  Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  ## Evaluation
139
 
 
140
 
141
  ### Testing Data, Factors & Metrics
142
 
 
144
 
145
  Evaluated on SA-V and other standard video and image segmentation benchmarks.
146
 
147
+ #### Metrics
148
 
149
+ Segmentation accuracy (IoU, Dice). Speed/Throughput (frames per second).
150
 
151
+ SAM 2.1 checkpoints
152
 
153
+ The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.
154
+ | **Model** | **Size (M)** | **Speed (FPS)** | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
155
+ | :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
156
+ | sam2.1_hiera_tiny | 38.9 | 91.2 | 76.5 | 71.8 | 77.3 |
157
+ | sam2.1_hiera_small | 46 | 84.8 | 76.6 | 73.5 | 78.3 |
158
+ | sam2.1_hiera_base_plus| 80.8 | 64.1 | 78.2 | 73.7 | 78.2 |
159
+ | sam2.1_hiera_large | 224.4 | 39.5 | 79.5 | 74.6 | 80.6 |
160
+
161
+ SAM 2 checkpoints
162
 
163
+ The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows:
164
 
165
+ | **Model** | **Size (M)** | **Speed (FPS)** | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
166
+ | :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
167
+ | sam2_hiera_tiny | 38.9 | 91.5 | 75.0 | 70.9 | 75.3 |
168
+ | sam2_hiera_small | 46 | 85.6 | 74.9 | 71.5 | 76.4 |
169
+ | sam2_hiera_base_plus | 80.8 | 64.8 | 74.7 | 72.8 | 75.8 |
170
+ | sam2_hiera_large | 224.4 | 39.7 | 76.0 | 74.6 | 79.8 |
171
 
 
172
 
173
  ### Results
174
 
 
176
 
177
  Image segmentation: 6x faster and more accurate than original SAM.
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ## Citation [optional]
180
 
 
 
181
  **BibTeX:**
182
 
183
  @article{ravi2024sam2,
 
191
 
192
  Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.
193
 
 
 
 
 
 
 
 
 
 
 
194
  ## Model Card Authors [optional]
195
 
196
  [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)