sam2.1_hiera_tiny / README.md

Update README.md

2cb6e9f verified 7 months ago

9.04 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-segmentation
	---

	# Model Card for SAM 2: Segment Anything in Images and Videos

	Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6579e0eaa9e58aec614e9d97/XzEgSzh7osnlG2QcMjWB5.png)

	## Model Details

	### Model Description

	SAM 2 (Segment Anything Model 2) is a foundation model developed by Meta FAIR for promptable visual segmentation across both images and videos. It extends the capabilities of the original SAM by introducing a memory-driven, streaming architecture that enables real-time, interactive segmentation and tracking of objects even as they change or temporarily disappear across video frames. SAM 2 achieves state-of-the-art segmentation accuracy with significantly improved speed and data efficiency, outperforming existing models for both images and videos.

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Meta FAIR (Meta AI Research), Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
	- Shared by [optional]: [Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)
	- Model type: Transformer-based promptable visual segmentation model with streaming memory module for videos.
	- License: Apache-2.0, BSD 3-Clause

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/facebookresearch/sam2
	- Paper [optional]: https://arxiv.org/abs/2408.00714
	- Demo [optional]: https://ai.meta.com/sam2/

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	SAM 2 is designed for:

	Promptable segmentation—select any object in video or image using points, boxes, or masks as prompts.

	Zero-shot segmentation—performs strongly even on objects, image domains, or videos not seen during training.

	Real-time, interactive applications—track or segment objects across frames, allowing corrections/refinements with new prompts as needed.

	Research and industrial applications—facilitates precise object segmentation in video editing, robotics, AR, medical imaging, and more.

	## Bias, Risks, and Limitations

	Generalization Limits: While designed for zero-shot generalization, rare or unseen visual domains may challenge model reliability.

	### Recommendations

	Human-in-the-loop review is advised for critical use cases.

	Users should evaluate and possibly retrain or fine-tune SAM 2 for highly specific domains.

	Ethical and privacy considerations must be taken into account, especially in surveillance or sensitive settings.

	## How to Get Started with the Model

	```
	from transformers import (
	Sam2Config,
	Sam2ImageProcessorFast,
	Sam2MaskDecoderConfig,
	Sam2MemoryAttentionConfig,
	Sam2MemoryEncoderConfig,
	Sam2Model,
	Sam2Processor,
	Sam2PromptEncoderConfig,
	Sam2VideoProcessor,
	Sam2VisionConfig,
	)

	image_processor = Sam2ImageProcessorFast()
	video_processor = Sam2VideoProcessor()
	processor = Sam2Processor(image_processor=image_processor, video_processor=video_processor)

	sam2model = Sam2Model.from_pretrained("danelcsb/sam2.1_hiera_tiny").to("cuda")

	# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
	# Try to load your custom video in here
	video_dir = "./videos/bedroom"

	# scan all the JPEG frame names in this directory
	frame_names = [
	p for p in os.listdir(video_dir)
	if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
	]
	frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))

	videos = []
	for frame_name in frame_names:
	videos.append(Image.open(os.path.join(video_dir, frame_name)))
	inference_state = processor.init_video_session(video=videos, inference_device="cuda")
	inference_state.reset_inference_session()

	ann_frame_idx = 0 # the frame index we interact with
	ann_obj_id = 1 # give a unique id to each object we interact with (it can be any integers)
	points = np.array([[210, 350]], dtype=np.float32)
	# for labels, `1` means positive click and `0` means negative click
	labels = np.array([1], np.int32)

	# Let's add a positive click at (x, y) = (210, 350) to get started
	inference_state = processor.process_new_points_or_box_for_video_frame(
	inference_state=inference_state,
	frame_idx=ann_frame_idx,
	obj_ids=ann_obj_id,
	input_points=points,
	input_labels=labels
	)
	any_res_masks, video_res_masks = sam2model.infer_on_video_frame_with_new_inputs(
	inference_state=inference_state,
	frame_idx=ann_frame_idx,
	obj_ids=ann_obj_id,
	consolidate_at_video_res=False,
	)
	```

	## Training Details

	### Training Data

	Trained using a data engine that collected the largest known video segmentation dataset, SA-V (Segment Anything Video dataset), via interactive human-model collaboration.

	Focused on full objects and parts, not restricted by semantic classes.

	### Training Procedure

	Preprocessing: Images and videos processed into masklets (spatio-temporal masks); prompts collected via human and model interaction loops.

	Training regime: Used standard transformer training routines with enhancements for real-time processing; likely mixed precision for scaling to large datasets.


	## Evaluation


	### Testing Data, Factors & Metrics

	#### Testing Data

	Evaluated on SA-V and other standard video and image segmentation benchmarks.

	#### Metrics

	Segmentation accuracy (IoU, Dice). Speed/Throughput (frames per second).

	SAM 2.1 checkpoints

	The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.
	\| Model \| Size (M) \| Speed (FPS) \| SA-V test (J&F) \| MOSE val (J&F) \| LVOS v2 (J&F) \|
	\| :------------------: \| :----------: \| :--------------------: \| :-----------------: \| :----------------: \| :---------------: \|
	\| sam2.1_hiera_tiny \| 38.9 \| 91.2 \| 76.5 \| 71.8 \| 77.3 \|
	\| sam2.1_hiera_small \| 46 \| 84.8 \| 76.6 \| 73.5 \| 78.3 \|
	\| sam2.1_hiera_base_plus\| 80.8 \| 64.1 \| 78.2 \| 73.7 \| 78.2 \|
	\| sam2.1_hiera_large \| 224.4 \| 39.5 \| 79.5 \| 74.6 \| 80.6 \|

	SAM 2 checkpoints

	The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows:

	\| Model \| Size (M) \| Speed (FPS) \| SA-V test (J&F) \| MOSE val (J&F) \| LVOS v2 (J&F) \|
	\| :------------------: \| :----------: \| :--------------------: \| :-----------------: \| :----------------: \| :---------------: \|
	\| sam2_hiera_tiny \| 38.9 \| 91.5 \| 75.0 \| 70.9 \| 75.3 \|
	\| sam2_hiera_small \| 46 \| 85.6 \| 74.9 \| 71.5 \| 76.4 \|
	\| sam2_hiera_base_plus \| 80.8 \| 64.8 \| 74.7 \| 72.8 \| 75.8 \|
	\| sam2_hiera_large \| 224.4 \| 39.7 \| 76.0 \| 74.6 \| 79.8 \|


	### Results

	Video segmentation: Higher accuracy with 3x fewer user prompts versus prior approaches.

	Image segmentation: 6x faster and more accurate than original SAM.

	## Citation [optional]

	BibTeX:

	@article{ravi2024sam2,
	title={SAM 2: Segment Anything in Images and Videos},
	author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman R{\"a}dle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Doll\'ar and Christoph Feichtenhofer},
	journal={arXiv preprint arXiv:2408.00714},
	year={2024}
	}

	APA:

	Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.

	## Model Card Authors [optional]

	[Sangbum Choi](https://www.linkedin.com/in/daniel-choi-86648216b/) and [Yoni Gozlan](https://huggingface.co/yonigozlan)

	## Model Card Contact

	Meta FAIR (contact via support@segment-anything.com)