Update README with SAM3-LiteText evaluation results
Browse files
README.md
CHANGED
|
@@ -8,29 +8,201 @@ tags:
|
|
| 8 |
- efficient-sam
|
| 9 |
---
|
| 10 |
|
| 11 |
-
#
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
Vision-language segmentation models like SAM3 enable flexible, prompt-driven visual grounding but often rely on large text encoders designed for open-ended language understanding. SAM3-LiteText addresses this by replacing the original SAM3 text encoder with a compact **MobileCLIP** student optimized via knowledge distillation. This approach reduces text encoder parameters by up to 88% and significantly lowers the memory footprint while maintaining segmentation performance comparable to the original model.
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
- **Project Page:** [EfficientSAM3](https://simonzeng7108.github.io/efficientsam3/)
|
| 24 |
|
| 25 |
-
|
|
|
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
```python
|
| 30 |
from sam3.model_builder import build_efficientsam3_image_model
|
| 31 |
from sam3.model.sam3_image_processor import Sam3Processor
|
| 32 |
|
| 33 |
-
# Load model with
|
| 34 |
model = build_efficientsam3_image_model(
|
| 35 |
checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
|
| 36 |
backbone_type="tinyvit",
|
|
@@ -42,24 +214,288 @@ model = build_efficientsam3_image_model(
|
|
| 42 |
processor = Sam3Processor(model)
|
| 43 |
inference_state = processor.set_image(image)
|
| 44 |
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
|
| 45 |
-
|
| 46 |
masks = inference_state["masks"]
|
| 47 |
scores = inference_state["scores"]
|
| 48 |
-
print(
|
| 49 |
```
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
## Citation
|
| 52 |
|
| 53 |
-
If you use
|
| 54 |
|
| 55 |
```bibtex
|
| 56 |
@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
}
|
| 65 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- efficient-sam
|
| 9 |
---
|
| 10 |
|
| 11 |
+
### EfficientSAM3: Progressive Hierarchical Knowledge Distillation (PhD) from SAM1, 2 and 3
|
| 12 |
+
[Chengxi Simon Zeng](https://simonzeng7108.github.io/about/)<sup>1,∗</sup>, [Yuxuan Jiang](https://YuxuanJJ.github.io/)<sup>1</sup>, [Ge Gao](https://scholar.google.com/citations?user=j2_80ewAAAAJ&hl=en)<sup>1</sup>, [Shuai Wang](https://shuaiwang97.github.io/)<sup>2</sup>, [Duolikun Danier](https://danier97.github.io/)<sup>3</sup>, [Bin Zhu](https://binzhubz.github.io/)<sup>4</sup>, [Stevan Rudinac](https://stevanrudinac.com/)<sup>2</sup>, [David Bull](https://david-bull.github.io/)<sup>1</sup>, [Fan Aaron Zhang](https://fan-aaron-zhang.github.io/)<sup>1,†</sup>
|
| 13 |
|
| 14 |
+
<sup>1</sup>Visual Information Lab, University of Bristol; <sup>2</sup>University of Amsterdam; <sup>3</sup>School of Informatics, University of Edinburgh; <sup>4</sup>Singapore Management University
|
| 15 |
|
| 16 |
+
<sup>∗</sup>Primary Contributor
|
| 17 |
+
<sup>†</sup>Corresponding Author
|
| 18 |
+
|
| 19 |
+
[📄 Paper](https://arxiv.org/abs/2511.15833) | [🌐 Project Page](https://simonzeng7108.github.io/efficientsam3/) | [🤗 Hugging Face](https://huggingface.co/Simon7108528/EfficientSAM3) | [💬 Discord](https://discord.gg/FMyaQca7xT)
|
| 20 |
+
---
|
| 21 |
+
## Updates
|
| 22 |
+
- **[2026/02/18]** **SAM3-LiteText** released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. [Paper](https://arxiv.org/abs/2602.12173) available on arXiv.
|
| 23 |
+
- **[2026/01/11]** Stage 1 geometry-prompt fine-tuned (**ft**) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
|
| 24 |
+
- **[2025/12/08]** Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
|
| 25 |
+
- **[2025/12/02]** Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
|
| 26 |
+
- **[2025/11/25]** Teaser model released. See Above. More models are baking in the oven🔥.
|
| 27 |
+
- **[2025/10/18]** Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
## Table of Contents
|
| 32 |
+
|
| 33 |
+
- [Table of Contents](#table-of-contents)
|
| 34 |
+
- [Updates](#updates)
|
| 35 |
+
- [Installation](#installation)
|
| 36 |
+
- [Inference](#inference)
|
| 37 |
+
- [Training and Evaluation](#training-and-evaluation)
|
| 38 |
+
- [Datasets](#datasets)
|
| 39 |
+
- [EfficientSAM3 Model Zoo \& Weight Release](#efficientsam3-model-zoo--weight-release)
|
| 40 |
+
- [Preliminary Evaluation](#preliminary-evaluation)
|
| 41 |
+
- [CoreML / ONNX Export](#coreml--onnx-export)
|
| 42 |
+
- [Web Demo](#web-demo)
|
| 43 |
+
- [Development To-Do List](#development-to-do-list)
|
| 44 |
+
- [Call for Pull Requests](#call-for-pull-requests)
|
| 45 |
+
- [Citation](#citation)
|
| 46 |
+
- [License](#license)
|
| 47 |
+
- [Acknowledgments](#acknowledgments)
|
| 48 |
+
- [Users](#users)
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
[SAM3](https://github.com/facebookresearch/sam3) (Segment Anything Model 3) has introduced powerful **Promptable Concept Segmentation (PCS)** capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.
|
| 53 |
+
|
| 54 |
+
**EfficientSAM3** addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.
|
| 55 |
+
|
| 56 |
+
<p align="center">
|
| 57 |
+
<img src="assets/efficientsam3_full.svg" alt="EfficientSAM3 Architecture" width="100%">
|
| 58 |
+
</p>
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
<details>
|
| 66 |
+
<summary>Supported Models and Architecture</summary>
|
| 67 |
+
|
| 68 |
+
| Component | Model/Backbone | Purpose |
|
| 69 |
+
|-----------|----------------|---------|
|
| 70 |
+
| **Teacher Models** | [SAM](https://github.com/facebookresearch/segment-anything) (Segment Anything Model) | Foundation for image-level encoder distillation |
|
| 71 |
+
| | [SAM2](https://github.com/facebookresearch/sam2) | Temporal memory and video tracking distillation |
|
| 72 |
+
| | [SAM3](https://github.com/facebookresearch/sam3) | Promptable Concept Segmentation (PCS) capabilities |
|
| 73 |
+
| **Datasets** | [SA-1B](https://ai.meta.com/datasets/segment-anything/) | Image segmentation dataset |
|
| 74 |
+
| | [SA-V](https://ai.meta.com/datasets/segment-anything-video/) | Video object segmentation dataset |
|
| 75 |
+
| | [SA-Co/Gold](https://huggingface.co/datasets/facebook/SACo-Gold) | Promptable concept segmentation benchmark |
|
| 76 |
+
| | [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) | Large-scale image-text dataset for text encoder distillation |
|
| 77 |
+
| **Student Backbones (Image)** | [RepViT](https://github.com/THU-MIG/RepViT) (M0.9, M1.1, M2.3) | Mobile-optimized Vision Transformer for highest throughput |
|
| 78 |
+
| | [TinyViT](https://github.com/wkcn/TinyViT) (5M, 11M, 21M) | Balanced efficiency and performance |
|
| 79 |
+
| | [EfficientViT](https://github.com/mit-han-lab/efficientvit) (B0, B1, B2) | Ultra-lightweight architectures for minimal latency |
|
| 80 |
+
| **Student Backbones (Text)** | [MobileCLIP](https://github.com/apple/ml-mobileclip) S0 | Lightweight text encoder (42.57M params) |
|
| 81 |
+
| | [MobileCLIP](https://github.com/apple/ml-mobileclip) S1 | Balanced text encoder (63.56M params) |
|
| 82 |
+
| | [MobileCLIP2](https://github.com/apple/ml-mobileclip) L | Larger text encoder (123.6M params) |
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
</details>
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
<details>
|
| 90 |
+
<summary>Three-Stage Progressive Training Curriculum</summary>
|
| 91 |
+
|
| 92 |
+
EfficientSAM3 is trained through a three-stage progressive distillation:
|
| 93 |
+
|
| 94 |
+
### Stage 1: Encoder Distillation (Image-Level Segmentation)
|
| 95 |
+
|
| 96 |
+
- Distill the SAM3 image encoder to nine student backbones (3 RepViT × 3 TinyViT × 3 EfficientViT variants)
|
| 97 |
+
- Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
|
| 98 |
+
- Use [SA-1B](https://ai.meta.com/datasets/segment-anything/) dataset with Prompt-in-the-Loop Distillation for image encoder distillation
|
| 99 |
+
- Use [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) dataset for text encoder distillation
|
| 100 |
+
- Align student backbone features with teacher encoder outputs.
|
| 101 |
+
|
| 102 |
+
### Stage 2: Temporal Memory Distillation (Video Tracking)
|
| 103 |
+
|
| 104 |
+
- Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
|
| 105 |
+
- Distill memory-conditioned mask predictions using [SA-V](https://ai.meta.com/datasets/segment-anything-video/) dataset
|
| 106 |
+
- Train the Perceiver module to compress and retrieve spatiotemporal features efficiently
|
| 107 |
+
|
| 108 |
+
### Stage 3: End-to-End Fine-Tuning (Concept Segmentation)
|
| 109 |
+
|
| 110 |
+
- Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
|
| 111 |
+
- Joint optimization of distilled encoder + compressed memory + mask decoder
|
| 112 |
+
- Preserve Promptable Concept Segmentation capabilities while maintaining efficiency
|
| 113 |
+
|
| 114 |
+
### tl;dr
|
| 115 |
+
Stage 1: We distill the SAM3 encoder using SAM1 data. <br>
|
| 116 |
+
Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data. <br>
|
| 117 |
+
Stage 3: We fine-tune the complete pipeline using SAM3 data. <br>
|
| 118 |
+
|
| 119 |
+
</details>
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## Installation
|
| 125 |
+
|
| 126 |
+
EfficientSAM3 purposely shares the same software contract as upstream SAM3:
|
| 127 |
+
|
| 128 |
+
- **Python** ≥ 3.12
|
| 129 |
+
- **PyTorch** 2.7.0 (CUDA 12.6 build recommended)
|
| 130 |
+
- **CUDA**-capable GPUs with drivers that support CUDA ≥ 12.6
|
| 131 |
+
|
| 132 |
+
Follow the exact environment setup from the [official SAM3 README](sam3/README.md) or use the condensed steps below:
|
| 133 |
|
|
|
|
| 134 |
|
| 135 |
+
```bash
|
| 136 |
+
git clone https://github.com/SimonZeng7108/efficientsam3.git
|
| 137 |
+
cd efficientsam3
|
| 138 |
|
| 139 |
+
conda create -n efficientsam3 python=3.12 -y
|
| 140 |
+
conda activate efficientsam3
|
|
|
|
| 141 |
|
| 142 |
+
pip install --upgrade pip
|
| 143 |
+
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
|
| 144 |
|
| 145 |
+
# Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
|
| 146 |
+
pip install -e ".[stage1]"
|
| 147 |
+
|
| 148 |
+
# Note: the Stage-1 extra includes the SAM1 package dependency
|
| 149 |
+
# (PyPI name: segment-anything, import name: segment_anything).
|
| 150 |
+
# If your environment cannot resolve it from PyPI, install the vendored repo instead:
|
| 151 |
+
# pip install -e ./segment-anything
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Inference
|
| 157 |
+
|
| 158 |
+
Download checkpoints from the [Model Zoo](#efficientsam3-model-zoo--weight-release) section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.
|
| 159 |
+
|
| 160 |
+
**Quick Start (Image Segmentation):**
|
| 161 |
+
#### 🔥 Teaser Image Model
|
| 162 |
+
<p align="center">
|
| 163 |
+
<img src="assets/es-ev-s-teaser.jpg" width="30%">
|
| 164 |
+
</p>
|
| 165 |
+
|
| 166 |
+
**EfficientViT-S (0.68M params)** distilled from **SAM3 Encoder (461.84M)** — **99.85% smaller**, trained on **1% SA-1B**.
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
from sam3.model_builder import build_efficientsam3_image_model
|
| 170 |
+
from sam3.model.sam3_image_processor import Sam3Processor
|
| 171 |
+
|
| 172 |
+
# Load model
|
| 173 |
+
model = build_efficientsam3_image_model(
|
| 174 |
+
checkpoint_path="efficient_sam3_efficientvit_s.pt",
|
| 175 |
+
backbone_type="efficientvit",
|
| 176 |
+
model_name="b0",
|
| 177 |
+
enable_inst_interactivity=True,
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
# Process image and predict
|
| 181 |
+
processor = Sam3Processor(model)
|
| 182 |
+
inference_state = processor.set_image(image)
|
| 183 |
+
|
| 184 |
+
# Single positive point prompt (x, y) in pixels
|
| 185 |
+
points = [[image.size[0] / 2, image.size[1] / 2]]
|
| 186 |
+
labels = [1]
|
| 187 |
+
masks, scores, _ = model.predict_inst(
|
| 188 |
+
inference_state,
|
| 189 |
+
point_coords=points,
|
| 190 |
+
point_labels=labels
|
| 191 |
+
)
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
#### 🔥 Teaser Text Prompt Model
|
| 195 |
+
<p align="center">
|
| 196 |
+
<img src="assets/es-tv-mc-m-teaser.png" width="30%">
|
| 197 |
+
</p>
|
| 198 |
+
|
| 199 |
+
**MobileCLIP-S1 (63.56M)** distilled from **SAM3 Text Encoder (353.72M)** — trained on **1% Recap-DataComp-1B**.
|
| 200 |
|
| 201 |
```python
|
| 202 |
from sam3.model_builder import build_efficientsam3_image_model
|
| 203 |
from sam3.model.sam3_image_processor import Sam3Processor
|
| 204 |
|
| 205 |
+
# Load model with text encoder
|
| 206 |
model = build_efficientsam3_image_model(
|
| 207 |
checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
|
| 208 |
backbone_type="tinyvit",
|
|
|
|
| 214 |
processor = Sam3Processor(model)
|
| 215 |
inference_state = processor.set_image(image)
|
| 216 |
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
|
|
|
|
| 217 |
masks = inference_state["masks"]
|
| 218 |
scores = inference_state["scores"]
|
| 219 |
+
print(len(scores), scores)
|
| 220 |
```
|
| 221 |
|
| 222 |
+
#### 🔥 SAM3-LiteText Model
|
| 223 |
+
Build a SAM3-LiteText model with a single call — the builder handles text encoder creation, checkpoint loading, and context length truncation internally.
|
| 224 |
+
|
| 225 |
+
```python
|
| 226 |
+
from sam3.model_builder import build_sam3_image_model
|
| 227 |
+
from sam3.model.sam3_image_processor import Sam3Processor
|
| 228 |
+
|
| 229 |
+
# Build SAM3-LiteText model
|
| 230 |
+
# Supported text_encoder_type: "MobileCLIP-S0", "MobileCLIP-S1", "MobileCLIP2-L"
|
| 231 |
+
# Supported text_encoder_context_length: 16, 32, or 77
|
| 232 |
+
model = build_sam3_image_model(
|
| 233 |
+
checkpoint_path="efficient_sam3_image_encoder_mobileclip_s1_ctx32.pt",
|
| 234 |
+
load_from_HF=False,
|
| 235 |
+
text_encoder_type="MobileCLIP-S1",
|
| 236 |
+
text_encoder_context_length=16,
|
| 237 |
+
device='cuda',
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
# Run inference
|
| 241 |
+
processor = Sam3Processor(model, device='cuda', confidence_threshold=0.4)
|
| 242 |
+
state = processor.set_image(image)
|
| 243 |
+
state = processor.set_text_prompt("shoe", state)
|
| 244 |
+
masks = state["masks"]
|
| 245 |
+
scores = state["scores"]
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
For detailed examples including point/box prompts, batched inference, and more, see [sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py](sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py). For text prompt inference, see [sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb](sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb). For SAM3-LiteText inference examples, see [sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py](sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py) (image) and [sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb](sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb) (video).
|
| 249 |
+
|
| 250 |
+
---
|
| 251 |
+
|
| 252 |
+
## Training and Evaluation
|
| 253 |
+
|
| 254 |
+
**Training:**
|
| 255 |
+
- For Stage 1 encoder distillation training details, see [README_stage1.md](README_stage1.md). For Stage 1 geometry fine-tuning, check the `stage1_geometry_finetune` branch.
|
| 256 |
+
- Stage 2 and Stage 3 training details coming soon.
|
| 257 |
+
|
| 258 |
+
**Evaluation:**
|
| 259 |
+
- To evaluate models on COCO dataset:
|
| 260 |
+
```bash
|
| 261 |
+
python eval/eval_coco.py --coco_root data/coco --output_dir output
|
| 262 |
+
```
|
| 263 |
+
|
| 264 |
+
- To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):
|
| 265 |
+
```bash
|
| 266 |
+
python eval/eval_text_encoder_similarity.py \
|
| 267 |
+
--student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \
|
| 268 |
+
--np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
|
| 269 |
+
--device cuda
|
| 270 |
+
# Optional: override teacher checkpoint
|
| 271 |
+
# --teacher-ckpt /path/to/sam3_teacher_checkpoint.pt
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
---
|
| 275 |
+
|
| 276 |
+
## Datasets
|
| 277 |
+
|
| 278 |
+
For dataset setup and download scripts (`data/download_*.sh`) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:
|
| 279 |
+
|
| 280 |
+
- [README_dataset.md](README_dataset.md)
|
| 281 |
+
|
| 282 |
+
---
|
| 283 |
+
|
| 284 |
+
|
| 285 |
+
## EfficientSAM3 Model Zoo & Weight Release
|
| 286 |
+
|
| 287 |
+
### SAM3 Text Encoder + EfficientSAM3 Image Encoder Models
|
| 288 |
+
|
| 289 |
+
| Model Name | Backbone | Parameters | Stage 1 Weights<br/>(Encoder Distilled) | Stage 2 Weights<br/>(Memory Module Trained) | Stage 3 Weights<br/>(End-to-End Fine-Tuned) |
|
| 290 |
+
|------------|----------|------------|----------------------------------------|---------------------------------------------|---------------------------------------------|
|
| 291 |
+
| **ES-RV-S** | RepViT-M0.9 | 4.72M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_s.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 292 |
+
| **ES-RV-M** | RepViT-M1.1 | 7.77M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_m.pt) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_m_geo_ft.pt)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 293 |
+
| **ES-RV-L** | RepViT-M2.3 | 22.40M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_l.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 294 |
+
| **ES-TV-S** | TinyViT-5M | 5.07M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_s.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 295 |
+
| **ES-TV-M** | TinyViT-11M | 10.55M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_m.pt) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_m_geo_ft.pt)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 296 |
+
| **ES-TV-L** | TinyViT-21M | 20.62M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_l.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 297 |
+
| **ES-EV-S** | EfficientViT-B0 | 0.68M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_s.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 298 |
+
| **ES-EV-M** | EfficientViT-B1 | 4.64M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_m.pt) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_m_geo_ft.pt)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 299 |
+
| **ES-EV-L** | EfficientViT-B2 | 14.98M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_l.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 300 |
+
|
| 301 |
+
> **Note (2025/12/02):** The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.
|
| 302 |
+
|
| 303 |
+
> **Note (2026/01/11):** The fine-tuned (**ft**) models use geometry-prompt fine-tuning on the same 1% subset of SA-1B; see training details in the `stage1_geometry_finetune` branch.
|
| 304 |
+
|
| 305 |
+
### EfficientSAM3 Text Encoder + EfficientSAM3 Image Encoder Models
|
| 306 |
+
|
| 307 |
+
| Model Name | Backbone | Parameters | Stage 1 Weights<br/>(Encoder Distilled) | Stage 2 Weights<br/>(Memory Module Trained) | Stage 3 Weights<br/>(End-to-End Fine-Tuned) |
|
| 308 |
+
|------------|----------|------------|----------------------------------------|---------------------------------------------|---------------------------------------------|
|
| 309 |
+
| **ES-RV-S-MC-S1** | RepViT-M0.9 & MobileCLIP-S1 | 4.72M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit-m0_9_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 310 |
+
| **ES-RV-M-MC-S1** | RepViT-M1.1 & MobileCLIP-S1 | 7.77M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit-m1_1_mobileclip_s1.pth) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_m1.1_mobileclip_s1_ft.pth)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 311 |
+
| **ES-RV-L-MC-S1** | RepViT-M2.3 & MobileCLIP-S1 | 22.40M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit-m2_3_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 312 |
+
| **ES-TV-S-MC-S1** | TinyViT-5M & MobileCLIP-S1 | 5.07M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_5m_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 313 |
+
| **ES-TV-M-MC-S1** | TinyViT-11M & MobileCLIP-S1 | 10.55M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_11m_mobileclip_s1.pth) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tiny_vit_11m_mobileclip_s1_ft.pth)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 314 |
+
| **ES-TV-L-MC-S1** | TinyViT-21M & MobileCLIP-S1 | 20.62M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_21m_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 315 |
+
| **ES-EV-S-MC-S1** | EfficientViT-B0 & MobileCLIP-S1 | 0.68M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit-b0_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 316 |
+
| **ES-EV-M-MC-S1** | EfficientViT-B1 & MobileCLIP-S1 | 4.64M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit-b1_mobileclip_s1.pth) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_b1_mobileclip_s1_ft.pth)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 317 |
+
| **ES-EV-L-MC-S1** | EfficientViT-B2 & MobileCLIP-S1 | 14.98M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit-b2_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
|
| 318 |
+
> **Note (2025/12/08):** The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset combined with all 9 image encoder variants. We notice a performance degradation, this is expected as the text encoder are not aligning with the light image encoders in stage1. We will release the stage1+ fine-tuned weights in the future.
|
| 319 |
+
|
| 320 |
+
> **Note (2025/12/08):** We have also uploaded standalone text encoder weights trained on 1% Recap-DataComp-1B dataset: MobileCLIP-S1 and MobileCLIP2-L. You can merge with stage 1 trained image encoder weights to get the full model.
|
| 321 |
+
|
| 322 |
+
> **Note (2026/01/11):** The fine-tuned (**ft**) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.
|
| 323 |
+
|
| 324 |
+
### SAM3-LiteText Models
|
| 325 |
+
|
| 326 |
+
SAM3-LiteText replaces the SAM3 text encoder with a lightweight distilled text encoder, reducing text encoder parameters by up to **88%** with comparable performance. See the [SAM3-LiteText paper](https://arxiv.org/abs/2602.12173) for details.
|
| 327 |
+
|
| 328 |
+
| Model | Text Encoder | Ctx | Text Params | Weights |
|
| 329 |
+
|-------|--------------|-----|-------------|---------|
|
| 330 |
+
| **SAM3-LiteText-S0-16** | MobileCLIP-S0 | 16 | 42.54M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/sam3_litetext/efficient_sam3_image_encoder_mobileclip_s0_ctx16.pt) |
|
| 331 |
+
| **SAM3-LiteText-S1-16** | MobileCLIP-S1 | 16 | 63.53M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/sam3_litetext/efficient_sam3_image_encoder_mobileclip_s1_ctx16.pt) |
|
| 332 |
+
| **SAM3-LiteText-L-16** | MobileCLIP2-L | 16 | 123.80M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/sam3_litetext/efficient_sam3_image_encoder_mobileclip2_l_ctx16.pt) |
|
| 333 |
+
|
| 334 |
+
> All models use the **SAM3 ViT-H image encoder** (353.72M vision params). The text encoder parameters shown represent the distilled student replacing the original 353.72M text encoder, achieving up to **88% parameter reduction**.
|
| 335 |
+
|
| 336 |
+
---
|
| 337 |
+
|
| 338 |
+
## Preliminary Evaluation
|
| 339 |
+
|
| 340 |
+
<details>
|
| 341 |
+
<summary>Stage 1 Image Model Evaluation Results (COCO val2017)</summary>
|
| 342 |
+
|
| 343 |
+
| Model Name | Backbone | Parameters | COCO mIoU | Test Time (s) |
|
| 344 |
+
|------------|----------|------------|-----------|---------------|
|
| 345 |
+
| **ES-RV-S** | RepViT-M0.9 | 4.72M | 64.80% | 407.23 |
|
| 346 |
+
| **ES-RV-M** | RepViT-M1.1 | 7.77M | 65.28% (ft 65.60%) | 413.38 |
|
| 347 |
+
| **ES-RV-L** | RepViT-M2.3 | 22.40M | 65.53% | 466.66 |
|
| 348 |
+
| **ES-TV-S** | TinyViT-5M | 5.07M | 65.51% | 430.52 |
|
| 349 |
+
| **ES-TV-M** | TinyViT-11M | 10.55M | 65.45% (ft 65.69%) | 443.45 |
|
| 350 |
+
| **ES-TV-L** | TinyViT-21M | 20.62M | 66.29% | 452.14 |
|
| 351 |
+
| **ES-EV-S** | EfficientViT-B0 | 0.68M | 61.62% | 419.57 |
|
| 352 |
+
| **ES-EV-M** | EfficientViT-B1 | 4.64M | 64.82% (ft 64.94%) | 434.45 |
|
| 353 |
+
| **ES-EV-L** | EfficientViT-B2 | 14.98M | 66.30% | 450.36 |
|
| 354 |
+
|
| 355 |
+
> **Note:** The evaluation is done with a single NVIDIA 4070 Ti.
|
| 356 |
+
|
| 357 |
+
</details>
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
<details>
|
| 361 |
+
<summary>Stage 1 Text Encoder Evaluation Results (SA-Co/VEval Noun Phrases)</summary>
|
| 362 |
+
|
| 363 |
+
Metric: average token-level cosine similarity between student text features and SAM3 text encoder features.
|
| 364 |
+
|
| 365 |
+
**Pretrained on 1% Recap-DataComp-1B**
|
| 366 |
+
|
| 367 |
+
| Model Name | Text Backbone | Avg Cos Similarity | Eval Set |
|
| 368 |
+
|------------|--------------|-------------------|----------|
|
| 369 |
+
| **ES-MC-S0 (Recap-DC1B 1% pt)** | MobileCLIP-S0 | 0.864846 | 5184 noun phrases |
|
| 370 |
+
| **ES-MC-S1 (Recap-DC1B 1% pt)** | MobileCLIP-S1 | 0.854405 | 5184 noun phrases |
|
| 371 |
+
| **ES-MC2-L (Recap-DC1B 1% pt)** | MobileCLIP2-L | 0.850976 | 5184 noun phrases |
|
| 372 |
+
|
| 373 |
+
**Fine-tuned on SA-Co Gold+Silver text annotations**
|
| 374 |
+
|
| 375 |
+
| Model Name | Text Backbone | Avg Cos Similarity | Eval Set |
|
| 376 |
+
|------------|--------------|-------------------|----------|
|
| 377 |
+
| **ES-MC-S0 (SA-Co ft)** | MobileCLIP-S0 | 0.938915 | 5184 noun phrases |
|
| 378 |
+
| **ES-MC-S1 (SA-Co ft)** | MobileCLIP-S1 | 0.947152 | 5184 noun phrases |
|
| 379 |
+
| **ES-MC2-L (SA-Co ft)** | MobileCLIP2-L | 0.952901 | 5184 noun phrases |
|
| 380 |
+
|
| 381 |
+
> **Note:** Evaluation is done with [eval_text_encoder_similarity.py](eval/eval_text_encoder_similarity.py) using `data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json`. Pretrained models are trained on Recap-DataComp-1B (1%), and fine-tuned models are trained on SA-Co Gold+Silver text annotations.
|
| 382 |
+
|
| 383 |
+
</details>
|
| 384 |
+
|
| 385 |
+
<details>
|
| 386 |
+
<summary>SAM3-LiteText Evaluation Results (SA-Co/Gold, Metric: CG_F1)</summary>
|
| 387 |
+
|
| 388 |
+
| Model | Ctx | MetaClip | SA1B | Crowd | Food | SptEq | Attr | Wiki | **Avg F1** | **MCC** | **pmF1** |
|
| 389 |
+
|-------|-----|----------|------|-------|------|-------|------|------|------------|---------|----------|
|
| 390 |
+
| **gDino-T** | - | 2.9 | 3.1 | 0.28 | 0.96 | 1.1 | 13.8 | 0.70 | 3.3 | 0.15 | 16.2 |
|
| 391 |
+
| **OWLv2** | - | 12.2 | 9.8 | 8.9 | 24.4 | 24.4 | 25.9 | 15.4 | 17.3 | 0.46 | 36.8 |
|
| 392 |
+
| **LLMDet-L** | - | 4.5 | 5.3 | 2.4 | 5.5 | 4.4 | 22.2 | 1.2 | 6.5 | 0.21 | 27.3 |
|
| 393 |
+
| **APE-D** | - | 12.6 | 2.2 | 7.2 | 22.7 | 31.8 | 26.7 | 11.6 | 16.4 | 0.40 | 36.9 |
|
| 394 |
+
| **DINO-X** | - | 17.2 | 19.7 | 12.9 | 30.1 | 28.4 | 31.0 | 9.7 | 21.3 | 0.38 | 55.2 |
|
| 395 |
+
| **Gemini 2.5** | - | 9.9 | 13.1 | 8.2 | 19.6 | 15.1 | 18.8 | 6.5 | 13.0 | 0.29 | 46.1 |
|
| 396 |
+
| **SAM3** | 77 | 47.3 | 53.7 | 61.1 | 53.4 | 65.5 | 54.9 | 42.5 | 54.1 | 0.82 | 66.1 |
|
| 397 |
+
| **SAM3-LiteText-S0** | 16 | 47.06 | 53.42 | 60.58 | 52.18 | 65.05 | 54.86 | 42.12 | 53.61 | 0.81 | 65.54 |
|
| 398 |
+
| **SAM3-LiteText-S1** | 16 | 47.18 | 53.58 | 60.76 | 52.43 | 65.28 | 55.02 | 42.35 | 53.80 | 0.81 | 65.72 |
|
| 399 |
+
| **SAM3-LiteText-L** | 16 | 47.24 | 53.66 | 60.88 | 52.65 | 65.49 | 55.19 | 42.54 | 53.95 | 0.81 | 65.87 |
|
| 400 |
+
|
| 401 |
+
> **Note:** This table shows performance of the released ctx-16 models, which were trained with a more extensive dataset mixture compared to the models reported in the paper. As a result, performance may differ slightly from the values in the associated publication.
|
| 402 |
+
|
| 403 |
+
</details>
|
| 404 |
+
|
| 405 |
+
---
|
| 406 |
+
|
| 407 |
+
|
| 408 |
+
## CoreML / ONNX Export
|
| 409 |
+
|
| 410 |
+
Coming soon: export pipelines to ONNX and CoreML for cross-platform deployment.
|
| 411 |
+
|
| 412 |
+
---
|
| 413 |
+
|
| 414 |
+
## Web Demo
|
| 415 |
+
|
| 416 |
+
Coming soon: an interactive web demo for real-time concept segmentation and tracking.
|
| 417 |
+
|
| 418 |
+
---
|
| 419 |
+
## Development To-Do List
|
| 420 |
+
|
| 421 |
+
- [x] **Release Stage 1 Image Encoder Weights**: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
|
| 422 |
+
- [x] **Release Stage 1 Text Encoder Weights**: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
|
| 423 |
+
- [x] **Release Stage 1+ Fine-Tuned Encoder Weights**: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
|
| 424 |
+
- [x] **Release SAM3-LiteText Weights**: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
|
| 425 |
+
- [ ] **Release Stage 2 Memory Bank Aligned Model Weights**: Models with Perceiver-based memory compression trained on SA-V dataset
|
| 426 |
+
- [ ] **Release Stage 3 Fine-Tuned Model Weights**: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
|
| 427 |
+
- [ ] **ONNX/CoreML Export**: Export models to ONNX and CoreML formats for cross-platform deployment
|
| 428 |
+
- [ ] **Web Demo**: Interactive web demonstration for real-time concept segmentation and tracking
|
| 429 |
+
|
| 430 |
+
---
|
| 431 |
+
|
| 432 |
+
## Call for Pull Requests
|
| 433 |
+
The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in [this paper](https://ieeexplore.ieee.org/abstract/document/11084428). Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.
|
| 434 |
+
|
| 435 |
+
We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for:
|
| 436 |
+
- Efficient MedSAM3 integration (see [MedSAM2 by Bo Wang Lab](https://github.com/bowang-lab/MedSAM2))
|
| 437 |
+
- A Gradio demo (e.g. [EfficientTAM on Hugging Face Spaces](https://huggingface.co/spaces/yunyangx/EfficientTAM))
|
| 438 |
+
- A web demo deployed with Vercel (e.g. [Segment Anything Web UI](https://segment-anything-webui.vercel.app/))
|
| 439 |
+
- Annotation tools, such as [X-AnyLabeling](https://github.com/CVHub520/X-AnyLabeling) and [AnyLabeling](https://github.com/vietanhdev/anylabeling)
|
| 440 |
+
- An iOS or Android app (e.g. [Cutcha Photo on the App Store](https://apps.apple.com/us/app/cutcha-photo/id6478521132))
|
| 441 |
+
- An NVCC-based desktop application
|
| 442 |
+
- Anything else that you think is cool!
|
| 443 |
+
---
|
| 444 |
+
|
| 445 |
+
All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.
|
| 446 |
+
|
| 447 |
## Citation
|
| 448 |
|
| 449 |
+
If you use EfficientSAM3 in your research, please cite:
|
| 450 |
|
| 451 |
```bibtex
|
| 452 |
@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
|
| 453 |
+
title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
|
| 454 |
+
author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
|
| 455 |
+
year={2025},
|
| 456 |
+
eprint={2511.15833},
|
| 457 |
+
archivePrefix={arXiv},
|
| 458 |
+
primaryClass={cs.CV},
|
| 459 |
+
url={https://arxiv.org/abs/2511.15833},
|
| 460 |
+
}
|
| 461 |
+
```
|
| 462 |
+
|
| 463 |
+
```bibtex
|
| 464 |
+
@misc{zeng2026sam3litetextanatomicalstudysam3,
|
| 465 |
+
title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
|
| 466 |
+
author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
|
| 467 |
+
year={2026},
|
| 468 |
+
eprint={2602.12173},
|
| 469 |
+
archivePrefix={arXiv},
|
| 470 |
+
primaryClass={cs.AI},
|
| 471 |
+
url={https://arxiv.org/abs/2602.12173},
|
| 472 |
}
|
| 473 |
+
```
|
| 474 |
+
|
| 475 |
+
## License
|
| 476 |
+
|
| 477 |
+
This repository is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|
| 478 |
+
|
| 479 |
+
This project builds upon [SAM](https://github.com/facebookresearch/segment-anything), [SAM2](https://github.com/facebookresearch/sam2), [SAM3](https://github.com/facebookresearch/sam3), [EdgeSAM](https://github.com/chongzhou96/EdgeSAM), [EdgeTAM](https://github.com/facebookresearch/EdgeTAM), [EfficientTAM](https://github.com/yformer/EfficientTAM), [RepViT](https://github.com/THU-MIG/RepViT), [TinyViT](https://github.com/wkcn/TinyViT), [EfficientViT](https://github.com/mit-han-lab/efficientvit), and [MobileCLIP](https://github.com/apple/ml-mobileclip). Please refer to their respective licenses for usage terms.
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
## Acknowledgments
|
| 483 |
+
|
| 484 |
+
We gratefully acknowledge the [University of Bristol Isambard-AI supercomputer cluster](https://www.bristol.ac.uk/research/centres/bristol-supercomputing/articles/2025/isambard-ai-is-11th-fastest-supercomputer-in-the-world.html) for providing computational resources to this project. Special thanks to [Dr. Fan Aaron Zhang](https://fan-aaron-zhang.github.io/) for allocating resources and supporting this research.
|
| 485 |
+
|
| 486 |
+
---
|
| 487 |
+
|
| 488 |
+
## Users
|
| 489 |
+
|
| 490 |
+
Organizations and projects using EfficientSAM3:
|
| 491 |
+
|
| 492 |
+
<table>
|
| 493 |
+
<tr>
|
| 494 |
+
<td align="center" width="20%">
|
| 495 |
+
<img src="https://github.com/SimonZeng7108/simonzeng7108.github.io/blob/main/efficientsam3/static/images/esa.png" alt="European Space Agency" height="80"><br>
|
| 496 |
+
<a href="https://www.esa.int/">European Space Agency</a>
|
| 497 |
+
</td>
|
| 498 |
+
</tr>
|
| 499 |
+
</table>
|
| 500 |
+
|
| 501 |
+
> **Note:** If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.
|