danelcsb
/

sam2.1_hiera_tiny

@@ -6,10 +6,9 @@ pipeline_tag: image-segmentation
 # Model Card for SAM 2: Segment Anything in Images and Videos
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
@@ -20,12 +19,9 @@ SAM 2 (Segment Anything Model 2) is a foundation model developed by Meta FAIR fo
 This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 - **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
 - **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos.
-- **Language(s) (NLP):** [More Information Needed]
 - **License:** Apache-2.0, BSD 3-Clause
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
@@ -43,7 +39,7 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 SAM 2 is designed for:
-Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts.
 Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.
@@ -51,18 +47,6 @@ Real-time, interactive applications—track or segment objects across frames, al
 Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
 Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.
@@ -77,9 +61,64 @@ Ethical and privacy considerations must be taken into account, especially in sur
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
@@ -95,24 +134,9 @@ Preprocessing: Images and videos processed into masklets (spatio-temporal masks)
 Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
@@ -120,19 +144,31 @@ Training regime: Used standard transformer training routines with enhancements f
 Evaluated on SA-V and other standard video and image segmentation benchmarks.
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-Segmentation accuracy (IoU, Dice).
-Prompt efficiency (number of user interactions to achieve target quality).
-Speed/Throughput (frames per second).
 ### Results
@@ -140,50 +176,8 @@ Video segmentation: Higher accuracy with 3x fewer user prompts versus prior appr
 Image segmentation: 6x faster and more accurate than original SAM.
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
 ## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
 @article{ravi2024sam2,
@@ -197,16 +191,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
 ## Model Card Authors [optional]
 [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)

 # Model Card for SAM 2: Segment Anything in Images and Videos
+Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/XzEgSzh7osnlG2QcMjWB5.png)
 ## Model Details
 This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 - **Developed by:** Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
+- **Shared by [optional]:** [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
 - **Model type:** Transformer-based promptable visual segmentation model with streaming memory module for videos.
 - **License:** Apache-2.0, BSD 3-Clause
 ### Model Sources [optional]
 SAM 2 is designed for:
+Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts.
 Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.
 Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.
 ## Bias, Risks, and Limitations
 Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.
 ## How to Get Started with the Model
+```
+from transformers import (
+    Sam2Config,
+    Sam2ImageProcessorFast,
+    Sam2MaskDecoderConfig,
+    Sam2MemoryAttentionConfig,
+    Sam2MemoryEncoderConfig,
+    Sam2Model,
+    Sam2Processor,
+    Sam2PromptEncoderConfig,
+    Sam2VideoProcessor,
+    Sam2VisionConfig,
+)
+image_processor = Sam2ImageProcessorFast()
+video_processor = Sam2VideoProcessor()
+processor = Sam2Processor(image_processor=image_processor, video_processor=video_processor)
+sam2model = Sam2Model.from_pretrained("danelcsb/sam2.1_hiera_tiny").to("cuda")
+# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+# Try to load your custom video in here
+video_dir = "./videos/bedroom"
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(video_dir)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+videos = []
+for frame_name in frame_names:
+    videos.append(Image.open(os.path.join(video_dir, frame_name)))
+inference_state = processor.init_video_session(video=videos, inference_device="cuda")
+inference_state.reset_inference_session()
+ann_frame_idx = 0  # the frame index we interact with
+ann_obj_id = 1  # give a unique id to each object we interact with (it can be any integers)
+points = np.array([[210, 350]], dtype=np.float32)
+# for labels, `1` means positive click and `0` means negative click
+labels = np.array([1], np.int32)
+# Let's add a positive click at (x, y) = (210, 350) to get started
+inference_state = processor.process_new_points_or_box_for_video_frame(
+    inference_state=inference_state,
+    frame_idx=ann_frame_idx,
+    obj_ids=ann_obj_id,
+    input_points=points,
+    input_labels=labels
+)
+any_res_masks, video_res_masks = sam2model.infer_on_video_frame_with_new_inputs(
+    inference_state=inference_state,
+    frame_idx=ann_frame_idx,
+    obj_ids=ann_obj_id,
+    consolidate_at_video_res=False,
+)
+```
 ## Training Details
 Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.
 ## Evaluation
 ### Testing Data, Factors & Metrics
 Evaluated on SA-V and other standard video and image segmentation benchmarks.
+#### Metrics
+Segmentation accuracy (IoU, Dice). Speed/Throughput (frames per second).
+SAM 2.1 checkpoints
+The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.
+|      **Model**       | **Size (M)** |    **Speed (FPS)**     | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
+| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
+|   sam2.1_hiera_tiny  |     38.9     |          91.2          |        76.5         |        71.8        |       77.3        |
+|   sam2.1_hiera_small |      46      |          84.8          |        76.6         |        73.5        |       78.3        |
+|   sam2.1_hiera_base_plus|     80.8     |        64.1          |        78.2         |        73.7        |       78.2        |
+|   sam2.1_hiera_large |    224.4     |          39.5          |        79.5         |        74.6        |       80.6        |
+SAM 2 checkpoints
+The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows:
+|      **Model**       | **Size (M)** |    **Speed (FPS)**     | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
+| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
+|   sam2_hiera_tiny    |     38.9     |          91.5          |        75.0         |        70.9        |       75.3        |
+|   sam2_hiera_small   |      46      |          85.6          |        74.9         |        71.5        |       76.4        |
+| sam2_hiera_base_plus |     80.8     |     64.8    |        74.7         |        72.8        |       75.8        |
+|   sam2_hiera_large   |    224.4     | 39.7 |        76.0         |        74.6        |       79.8        |
 ### Results
 Image segmentation: 6x faster and more accurate than original SAM.
 ## Citation [optional]
 **BibTeX:**
 @article{ravi2024sam2,
 Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.
 ## Model Card Authors [optional]
 [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)