Simon7108528 commited on
Commit
47b2e1f
·
verified ·
1 Parent(s): 8e2fca2

Update README with SAM3-LiteText evaluation results

Browse files
Files changed (1) hide show
  1. README.md +458 -22
README.md CHANGED
@@ -8,29 +8,201 @@ tags:
8
  - efficient-sam
9
  ---
10
 
11
- # SAM3-LiteText
 
12
 
13
- SAM3-LiteText is a lightweight text encoding framework for vision-language segmentation, introduced in the paper [SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation](https://huggingface.co/papers/2602.12173).
14
 
15
- ## Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- Vision-language segmentation models like SAM3 enable flexible, prompt-driven visual grounding but often rely on large text encoders designed for open-ended language understanding. SAM3-LiteText addresses this by replacing the original SAM3 text encoder with a compact **MobileCLIP** student optimized via knowledge distillation. This approach reduces text encoder parameters by up to 88% and significantly lowers the memory footprint while maintaining segmentation performance comparable to the original model.
18
 
19
- ## Resources
 
 
20
 
21
- - **Paper:** [SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation](https://huggingface.co/papers/2602.12173)
22
- - **Code:** [GitHub Repository (sam3_litetext branch)](https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext)
23
- - **Project Page:** [EfficientSAM3](https://simonzeng7108.github.io/efficientsam3/)
24
 
25
- ## Sample Usage
 
26
 
27
- The following example demonstrates how to perform inference with a text prompt using an EfficientSAM3 model variant with a distilled MobileCLIP text encoder:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ```python
30
  from sam3.model_builder import build_efficientsam3_image_model
31
  from sam3.model.sam3_image_processor import Sam3Processor
32
 
33
- # Load model with distilled text encoder
34
  model = build_efficientsam3_image_model(
35
  checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
36
  backbone_type="tinyvit",
@@ -42,24 +214,288 @@ model = build_efficientsam3_image_model(
42
  processor = Sam3Processor(model)
43
  inference_state = processor.set_image(image)
44
  inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
45
-
46
  masks = inference_state["masks"]
47
  scores = inference_state["scores"]
48
- print(f"Found {len(scores)} masks. Scores: {scores}")
49
  ```
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## Citation
52
 
53
- If you use SAM3-LiteText or the EfficientSAM3 framework in your research, please cite:
54
 
55
  ```bibtex
56
  @misc{zeng2025efficientsam3progressivehierarchicaldistillation,
57
- title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
58
- author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Fan Aaron Zhang},
59
- year={2025},
60
- eprint={2511.15833},
61
- archivePrefix={arXiv},
62
- primaryClass={cs.CV},
63
- url={https://arxiv.org/abs/2511.15833},
 
 
 
 
 
 
 
 
 
 
 
 
64
  }
65
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - efficient-sam
9
  ---
10
 
11
+ ### EfficientSAM3: Progressive Hierarchical Knowledge Distillation (PhD) from SAM1, 2 and 3
12
+ [Chengxi Simon Zeng](https://simonzeng7108.github.io/about/)<sup>1,∗</sup>, [Yuxuan Jiang](https://YuxuanJJ.github.io/)<sup>1</sup>, [Ge Gao](https://scholar.google.com/citations?user=j2_80ewAAAAJ&hl=en)<sup>1</sup>, [Shuai Wang](https://shuaiwang97.github.io/)<sup>2</sup>, [Duolikun Danier](https://danier97.github.io/)<sup>3</sup>, [Bin Zhu](https://binzhubz.github.io/)<sup>4</sup>, [Stevan Rudinac](https://stevanrudinac.com/)<sup>2</sup>, [David Bull](https://david-bull.github.io/)<sup>1</sup>, [Fan Aaron Zhang](https://fan-aaron-zhang.github.io/)<sup>1,†</sup>
13
 
14
+ <sup>1</sup>Visual Information Lab, University of Bristol; <sup>2</sup>University of Amsterdam; <sup>3</sup>School of Informatics, University of Edinburgh; <sup>4</sup>Singapore Management University
15
 
16
+ <sup>∗</sup>Primary Contributor
17
+ <sup>†</sup>Corresponding Author
18
+
19
+ [📄 Paper](https://arxiv.org/abs/2511.15833) | [🌐 Project Page](https://simonzeng7108.github.io/efficientsam3/) | [🤗 Hugging Face](https://huggingface.co/Simon7108528/EfficientSAM3) | [💬 Discord](https://discord.gg/FMyaQca7xT)
20
+ ---
21
+ ## Updates
22
+ - **[2026/02/18]** **SAM3-LiteText** released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. [Paper](https://arxiv.org/abs/2602.12173) available on arXiv.
23
+ - **[2026/01/11]** Stage 1 geometry-prompt fine-tuned (**ft**) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
24
+ - **[2025/12/08]** Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
25
+ - **[2025/12/02]** Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
26
+ - **[2025/11/25]** Teaser model released. See Above. More models are baking in the oven🔥.
27
+ - **[2025/10/18]** Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.
28
+ ---
29
+
30
+
31
+ ## Table of Contents
32
+
33
+ - [Table of Contents](#table-of-contents)
34
+ - [Updates](#updates)
35
+ - [Installation](#installation)
36
+ - [Inference](#inference)
37
+ - [Training and Evaluation](#training-and-evaluation)
38
+ - [Datasets](#datasets)
39
+ - [EfficientSAM3 Model Zoo \& Weight Release](#efficientsam3-model-zoo--weight-release)
40
+ - [Preliminary Evaluation](#preliminary-evaluation)
41
+ - [CoreML / ONNX Export](#coreml--onnx-export)
42
+ - [Web Demo](#web-demo)
43
+ - [Development To-Do List](#development-to-do-list)
44
+ - [Call for Pull Requests](#call-for-pull-requests)
45
+ - [Citation](#citation)
46
+ - [License](#license)
47
+ - [Acknowledgments](#acknowledgments)
48
+ - [Users](#users)
49
+
50
+ ---
51
+
52
+ [SAM3](https://github.com/facebookresearch/sam3) (Segment Anything Model 3) has introduced powerful **Promptable Concept Segmentation (PCS)** capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.
53
+
54
+ **EfficientSAM3** addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.
55
+
56
+ <p align="center">
57
+ <img src="assets/efficientsam3_full.svg" alt="EfficientSAM3 Architecture" width="100%">
58
+ </p>
59
+
60
+
61
+ ---
62
+
63
+
64
+
65
+ <details>
66
+ <summary>Supported Models and Architecture</summary>
67
+
68
+ | Component | Model/Backbone | Purpose |
69
+ |-----------|----------------|---------|
70
+ | **Teacher Models** | [SAM](https://github.com/facebookresearch/segment-anything) (Segment Anything Model) | Foundation for image-level encoder distillation |
71
+ | | [SAM2](https://github.com/facebookresearch/sam2) | Temporal memory and video tracking distillation |
72
+ | | [SAM3](https://github.com/facebookresearch/sam3) | Promptable Concept Segmentation (PCS) capabilities |
73
+ | **Datasets** | [SA-1B](https://ai.meta.com/datasets/segment-anything/) | Image segmentation dataset |
74
+ | | [SA-V](https://ai.meta.com/datasets/segment-anything-video/) | Video object segmentation dataset |
75
+ | | [SA-Co/Gold](https://huggingface.co/datasets/facebook/SACo-Gold) | Promptable concept segmentation benchmark |
76
+ | | [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) | Large-scale image-text dataset for text encoder distillation |
77
+ | **Student Backbones (Image)** | [RepViT](https://github.com/THU-MIG/RepViT) (M0.9, M1.1, M2.3) | Mobile-optimized Vision Transformer for highest throughput |
78
+ | | [TinyViT](https://github.com/wkcn/TinyViT) (5M, 11M, 21M) | Balanced efficiency and performance |
79
+ | | [EfficientViT](https://github.com/mit-han-lab/efficientvit) (B0, B1, B2) | Ultra-lightweight architectures for minimal latency |
80
+ | **Student Backbones (Text)** | [MobileCLIP](https://github.com/apple/ml-mobileclip) S0 | Lightweight text encoder (42.57M params) |
81
+ | | [MobileCLIP](https://github.com/apple/ml-mobileclip) S1 | Balanced text encoder (63.56M params) |
82
+ | | [MobileCLIP2](https://github.com/apple/ml-mobileclip) L | Larger text encoder (123.6M params) |
83
+
84
+
85
+ </details>
86
+
87
+ ---
88
+
89
+ <details>
90
+ <summary>Three-Stage Progressive Training Curriculum</summary>
91
+
92
+ EfficientSAM3 is trained through a three-stage progressive distillation:
93
+
94
+ ### Stage 1: Encoder Distillation (Image-Level Segmentation)
95
+
96
+ - Distill the SAM3 image encoder to nine student backbones (3 RepViT × 3 TinyViT × 3 EfficientViT variants)
97
+ - Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
98
+ - Use [SA-1B](https://ai.meta.com/datasets/segment-anything/) dataset with Prompt-in-the-Loop Distillation for image encoder distillation
99
+ - Use [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) dataset for text encoder distillation
100
+ - Align student backbone features with teacher encoder outputs.
101
+
102
+ ### Stage 2: Temporal Memory Distillation (Video Tracking)
103
+
104
+ - Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
105
+ - Distill memory-conditioned mask predictions using [SA-V](https://ai.meta.com/datasets/segment-anything-video/) dataset
106
+ - Train the Perceiver module to compress and retrieve spatiotemporal features efficiently
107
+
108
+ ### Stage 3: End-to-End Fine-Tuning (Concept Segmentation)
109
+
110
+ - Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
111
+ - Joint optimization of distilled encoder + compressed memory + mask decoder
112
+ - Preserve Promptable Concept Segmentation capabilities while maintaining efficiency
113
+
114
+ ### tl;dr
115
+ Stage 1: We distill the SAM3 encoder using SAM1 data. <br>
116
+ Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data. <br>
117
+ Stage 3: We fine-tune the complete pipeline using SAM3 data. <br>
118
+
119
+ </details>
120
+
121
+
122
+ ---
123
+
124
+ ## Installation
125
+
126
+ EfficientSAM3 purposely shares the same software contract as upstream SAM3:
127
+
128
+ - **Python** ≥ 3.12
129
+ - **PyTorch** 2.7.0 (CUDA 12.6 build recommended)
130
+ - **CUDA**-capable GPUs with drivers that support CUDA ≥ 12.6
131
+
132
+ Follow the exact environment setup from the [official SAM3 README](sam3/README.md) or use the condensed steps below:
133
 
 
134
 
135
+ ```bash
136
+ git clone https://github.com/SimonZeng7108/efficientsam3.git
137
+ cd efficientsam3
138
 
139
+ conda create -n efficientsam3 python=3.12 -y
140
+ conda activate efficientsam3
 
141
 
142
+ pip install --upgrade pip
143
+ pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
144
 
145
+ # Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
146
+ pip install -e ".[stage1]"
147
+
148
+ # Note: the Stage-1 extra includes the SAM1 package dependency
149
+ # (PyPI name: segment-anything, import name: segment_anything).
150
+ # If your environment cannot resolve it from PyPI, install the vendored repo instead:
151
+ # pip install -e ./segment-anything
152
+ ```
153
+
154
+ ---
155
+
156
+ ## Inference
157
+
158
+ Download checkpoints from the [Model Zoo](#efficientsam3-model-zoo--weight-release) section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.
159
+
160
+ **Quick Start (Image Segmentation):**
161
+ #### 🔥 Teaser Image Model
162
+ <p align="center">
163
+ <img src="assets/es-ev-s-teaser.jpg" width="30%">
164
+ </p>
165
+
166
+ **EfficientViT-S (0.68M params)** distilled from **SAM3 Encoder (461.84M)** — **99.85% smaller**, trained on **1% SA-1B**.
167
+
168
+ ```python
169
+ from sam3.model_builder import build_efficientsam3_image_model
170
+ from sam3.model.sam3_image_processor import Sam3Processor
171
+
172
+ # Load model
173
+ model = build_efficientsam3_image_model(
174
+ checkpoint_path="efficient_sam3_efficientvit_s.pt",
175
+ backbone_type="efficientvit",
176
+ model_name="b0",
177
+ enable_inst_interactivity=True,
178
+ )
179
+
180
+ # Process image and predict
181
+ processor = Sam3Processor(model)
182
+ inference_state = processor.set_image(image)
183
+
184
+ # Single positive point prompt (x, y) in pixels
185
+ points = [[image.size[0] / 2, image.size[1] / 2]]
186
+ labels = [1]
187
+ masks, scores, _ = model.predict_inst(
188
+ inference_state,
189
+ point_coords=points,
190
+ point_labels=labels
191
+ )
192
+ ```
193
+
194
+ #### 🔥 Teaser Text Prompt Model
195
+ <p align="center">
196
+ <img src="assets/es-tv-mc-m-teaser.png" width="30%">
197
+ </p>
198
+
199
+ **MobileCLIP-S1 (63.56M)** distilled from **SAM3 Text Encoder (353.72M)** — trained on **1% Recap-DataComp-1B**.
200
 
201
  ```python
202
  from sam3.model_builder import build_efficientsam3_image_model
203
  from sam3.model.sam3_image_processor import Sam3Processor
204
 
205
+ # Load model with text encoder
206
  model = build_efficientsam3_image_model(
207
  checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
208
  backbone_type="tinyvit",
 
214
  processor = Sam3Processor(model)
215
  inference_state = processor.set_image(image)
216
  inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
 
217
  masks = inference_state["masks"]
218
  scores = inference_state["scores"]
219
+ print(len(scores), scores)
220
  ```
221
 
222
+ #### 🔥 SAM3-LiteText Model
223
+ Build a SAM3-LiteText model with a single call — the builder handles text encoder creation, checkpoint loading, and context length truncation internally.
224
+
225
+ ```python
226
+ from sam3.model_builder import build_sam3_image_model
227
+ from sam3.model.sam3_image_processor import Sam3Processor
228
+
229
+ # Build SAM3-LiteText model
230
+ # Supported text_encoder_type: "MobileCLIP-S0", "MobileCLIP-S1", "MobileCLIP2-L"
231
+ # Supported text_encoder_context_length: 16, 32, or 77
232
+ model = build_sam3_image_model(
233
+ checkpoint_path="efficient_sam3_image_encoder_mobileclip_s1_ctx32.pt",
234
+ load_from_HF=False,
235
+ text_encoder_type="MobileCLIP-S1",
236
+ text_encoder_context_length=16,
237
+ device='cuda',
238
+ )
239
+
240
+ # Run inference
241
+ processor = Sam3Processor(model, device='cuda', confidence_threshold=0.4)
242
+ state = processor.set_image(image)
243
+ state = processor.set_text_prompt("shoe", state)
244
+ masks = state["masks"]
245
+ scores = state["scores"]
246
+ ```
247
+
248
+ For detailed examples including point/box prompts, batched inference, and more, see [sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py](sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py). For text prompt inference, see [sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb](sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb). For SAM3-LiteText inference examples, see [sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py](sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py) (image) and [sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb](sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb) (video).
249
+
250
+ ---
251
+
252
+ ## Training and Evaluation
253
+
254
+ **Training:**
255
+ - For Stage 1 encoder distillation training details, see [README_stage1.md](README_stage1.md). For Stage 1 geometry fine-tuning, check the `stage1_geometry_finetune` branch.
256
+ - Stage 2 and Stage 3 training details coming soon.
257
+
258
+ **Evaluation:**
259
+ - To evaluate models on COCO dataset:
260
+ ```bash
261
+ python eval/eval_coco.py --coco_root data/coco --output_dir output
262
+ ```
263
+
264
+ - To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):
265
+ ```bash
266
+ python eval/eval_text_encoder_similarity.py \
267
+ --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \
268
+ --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
269
+ --device cuda
270
+ # Optional: override teacher checkpoint
271
+ # --teacher-ckpt /path/to/sam3_teacher_checkpoint.pt
272
+ ```
273
+
274
+ ---
275
+
276
+ ## Datasets
277
+
278
+ For dataset setup and download scripts (`data/download_*.sh`) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:
279
+
280
+ - [README_dataset.md](README_dataset.md)
281
+
282
+ ---
283
+
284
+
285
+ ## EfficientSAM3 Model Zoo & Weight Release
286
+
287
+ ### SAM3 Text Encoder + EfficientSAM3 Image Encoder Models
288
+
289
+ | Model Name | Backbone | Parameters | Stage 1 Weights<br/>(Encoder Distilled) | Stage 2 Weights<br/>(Memory Module Trained) | Stage 3 Weights<br/>(End-to-End Fine-Tuned) |
290
+ |------------|----------|------------|----------------------------------------|---------------------------------------------|---------------------------------------------|
291
+ | **ES-RV-S** | RepViT-M0.9 | 4.72M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_s.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
292
+ | **ES-RV-M** | RepViT-M1.1 | 7.77M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_m.pt) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_m_geo_ft.pt)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
293
+ | **ES-RV-L** | RepViT-M2.3 | 22.40M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_l.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
294
+ | **ES-TV-S** | TinyViT-5M | 5.07M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_s.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
295
+ | **ES-TV-M** | TinyViT-11M | 10.55M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_m.pt) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_m_geo_ft.pt)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
296
+ | **ES-TV-L** | TinyViT-21M | 20.62M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_l.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
297
+ | **ES-EV-S** | EfficientViT-B0 | 0.68M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_s.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
298
+ | **ES-EV-M** | EfficientViT-B1 | 4.64M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_m.pt) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_m_geo_ft.pt)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
299
+ | **ES-EV-L** | EfficientViT-B2 | 14.98M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_l.pt) | $$\text{Planned}$$ | $$\text{Planned}$$ |
300
+
301
+ > **Note (2025/12/02):** The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.
302
+
303
+ > **Note (2026/01/11):** The fine-tuned (**ft**) models use geometry-prompt fine-tuning on the same 1% subset of SA-1B; see training details in the `stage1_geometry_finetune` branch.
304
+
305
+ ### EfficientSAM3 Text Encoder + EfficientSAM3 Image Encoder Models
306
+
307
+ | Model Name | Backbone | Parameters | Stage 1 Weights<br/>(Encoder Distilled) | Stage 2 Weights<br/>(Memory Module Trained) | Stage 3 Weights<br/>(End-to-End Fine-Tuned) |
308
+ |------------|----------|------------|----------------------------------------|---------------------------------------------|---------------------------------------------|
309
+ | **ES-RV-S-MC-S1** | RepViT-M0.9 & MobileCLIP-S1 | 4.72M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit-m0_9_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
310
+ | **ES-RV-M-MC-S1** | RepViT-M1.1 & MobileCLIP-S1 | 7.77M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit-m1_1_mobileclip_s1.pth) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit_m1.1_mobileclip_s1_ft.pth)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
311
+ | **ES-RV-L-MC-S1** | RepViT-M2.3 & MobileCLIP-S1 | 22.40M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_repvit-m2_3_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
312
+ | **ES-TV-S-MC-S1** | TinyViT-5M & MobileCLIP-S1 | 5.07M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_5m_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
313
+ | **ES-TV-M-MC-S1** | TinyViT-11M & MobileCLIP-S1 | 10.55M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_11m_mobileclip_s1.pth) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tiny_vit_11m_mobileclip_s1_ft.pth)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
314
+ | **ES-TV-L-MC-S1** | TinyViT-21M & MobileCLIP-S1 | 20.62M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_tinyvit_21m_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
315
+ | **ES-EV-S-MC-S1** | EfficientViT-B0 & MobileCLIP-S1 | 0.68M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit-b0_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
316
+ | **ES-EV-M-MC-S1** | EfficientViT-B1 & MobileCLIP-S1 | 4.64M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit-b1_mobileclip_s1.pth) (ft: [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit_b1_mobileclip_s1_ft.pth)) | $$\text{Planned}$$ | $$\text{Planned}$$ |
317
+ | **ES-EV-L-MC-S1** | EfficientViT-B2 & MobileCLIP-S1 | 14.98M + 63.56M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/stage1_all_converted/efficient_sam3_efficientvit-b2_mobileclip_s1.pth) | $$\text{Planned}$$ | $$\text{Planned}$$ |
318
+ > **Note (2025/12/08):** The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset combined with all 9 image encoder variants. We notice a performance degradation, this is expected as the text encoder are not aligning with the light image encoders in stage1. We will release the stage1+ fine-tuned weights in the future.
319
+
320
+ > **Note (2025/12/08):** We have also uploaded standalone text encoder weights trained on 1% Recap-DataComp-1B dataset: MobileCLIP-S1 and MobileCLIP2-L. You can merge with stage 1 trained image encoder weights to get the full model.
321
+
322
+ > **Note (2026/01/11):** The fine-tuned (**ft**) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.
323
+
324
+ ### SAM3-LiteText Models
325
+
326
+ SAM3-LiteText replaces the SAM3 text encoder with a lightweight distilled text encoder, reducing text encoder parameters by up to **88%** with comparable performance. See the [SAM3-LiteText paper](https://arxiv.org/abs/2602.12173) for details.
327
+
328
+ | Model | Text Encoder | Ctx | Text Params | Weights |
329
+ |-------|--------------|-----|-------------|---------|
330
+ | **SAM3-LiteText-S0-16** | MobileCLIP-S0 | 16 | 42.54M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/sam3_litetext/efficient_sam3_image_encoder_mobileclip_s0_ctx16.pt) |
331
+ | **SAM3-LiteText-S1-16** | MobileCLIP-S1 | 16 | 63.53M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/sam3_litetext/efficient_sam3_image_encoder_mobileclip_s1_ctx16.pt) |
332
+ | **SAM3-LiteText-L-16** | MobileCLIP2-L | 16 | 123.80M | [HF](https://huggingface.co/Simon7108528/EfficientSAM3/resolve/main/sam3_litetext/efficient_sam3_image_encoder_mobileclip2_l_ctx16.pt) |
333
+
334
+ > All models use the **SAM3 ViT-H image encoder** (353.72M vision params). The text encoder parameters shown represent the distilled student replacing the original 353.72M text encoder, achieving up to **88% parameter reduction**.
335
+
336
+ ---
337
+
338
+ ## Preliminary Evaluation
339
+
340
+ <details>
341
+ <summary>Stage 1 Image Model Evaluation Results (COCO val2017)</summary>
342
+
343
+ | Model Name | Backbone | Parameters | COCO mIoU | Test Time (s) |
344
+ |------------|----------|------------|-----------|---------------|
345
+ | **ES-RV-S** | RepViT-M0.9 | 4.72M | 64.80% | 407.23 |
346
+ | **ES-RV-M** | RepViT-M1.1 | 7.77M | 65.28% (ft 65.60%) | 413.38 |
347
+ | **ES-RV-L** | RepViT-M2.3 | 22.40M | 65.53% | 466.66 |
348
+ | **ES-TV-S** | TinyViT-5M | 5.07M | 65.51% | 430.52 |
349
+ | **ES-TV-M** | TinyViT-11M | 10.55M | 65.45% (ft 65.69%) | 443.45 |
350
+ | **ES-TV-L** | TinyViT-21M | 20.62M | 66.29% | 452.14 |
351
+ | **ES-EV-S** | EfficientViT-B0 | 0.68M | 61.62% | 419.57 |
352
+ | **ES-EV-M** | EfficientViT-B1 | 4.64M | 64.82% (ft 64.94%) | 434.45 |
353
+ | **ES-EV-L** | EfficientViT-B2 | 14.98M | 66.30% | 450.36 |
354
+
355
+ > **Note:** The evaluation is done with a single NVIDIA 4070 Ti.
356
+
357
+ </details>
358
+
359
+
360
+ <details>
361
+ <summary>Stage 1 Text Encoder Evaluation Results (SA-Co/VEval Noun Phrases)</summary>
362
+
363
+ Metric: average token-level cosine similarity between student text features and SAM3 text encoder features.
364
+
365
+ **Pretrained on 1% Recap-DataComp-1B**
366
+
367
+ | Model Name | Text Backbone | Avg Cos Similarity | Eval Set |
368
+ |------------|--------------|-------------------|----------|
369
+ | **ES-MC-S0 (Recap-DC1B 1% pt)** | MobileCLIP-S0 | 0.864846 | 5184 noun phrases |
370
+ | **ES-MC-S1 (Recap-DC1B 1% pt)** | MobileCLIP-S1 | 0.854405 | 5184 noun phrases |
371
+ | **ES-MC2-L (Recap-DC1B 1% pt)** | MobileCLIP2-L | 0.850976 | 5184 noun phrases |
372
+
373
+ **Fine-tuned on SA-Co Gold+Silver text annotations**
374
+
375
+ | Model Name | Text Backbone | Avg Cos Similarity | Eval Set |
376
+ |------------|--------------|-------------------|----------|
377
+ | **ES-MC-S0 (SA-Co ft)** | MobileCLIP-S0 | 0.938915 | 5184 noun phrases |
378
+ | **ES-MC-S1 (SA-Co ft)** | MobileCLIP-S1 | 0.947152 | 5184 noun phrases |
379
+ | **ES-MC2-L (SA-Co ft)** | MobileCLIP2-L | 0.952901 | 5184 noun phrases |
380
+
381
+ > **Note:** Evaluation is done with [eval_text_encoder_similarity.py](eval/eval_text_encoder_similarity.py) using `data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json`. Pretrained models are trained on Recap-DataComp-1B (1%), and fine-tuned models are trained on SA-Co Gold+Silver text annotations.
382
+
383
+ </details>
384
+
385
+ <details>
386
+ <summary>SAM3-LiteText Evaluation Results (SA-Co/Gold, Metric: CG_F1)</summary>
387
+
388
+ | Model | Ctx | MetaClip | SA1B | Crowd | Food | SptEq | Attr | Wiki | **Avg F1** | **MCC** | **pmF1** |
389
+ |-------|-----|----------|------|-------|------|-------|------|------|------------|---------|----------|
390
+ | **gDino-T** | - | 2.9 | 3.1 | 0.28 | 0.96 | 1.1 | 13.8 | 0.70 | 3.3 | 0.15 | 16.2 |
391
+ | **OWLv2** | - | 12.2 | 9.8 | 8.9 | 24.4 | 24.4 | 25.9 | 15.4 | 17.3 | 0.46 | 36.8 |
392
+ | **LLMDet-L** | - | 4.5 | 5.3 | 2.4 | 5.5 | 4.4 | 22.2 | 1.2 | 6.5 | 0.21 | 27.3 |
393
+ | **APE-D** | - | 12.6 | 2.2 | 7.2 | 22.7 | 31.8 | 26.7 | 11.6 | 16.4 | 0.40 | 36.9 |
394
+ | **DINO-X** | - | 17.2 | 19.7 | 12.9 | 30.1 | 28.4 | 31.0 | 9.7 | 21.3 | 0.38 | 55.2 |
395
+ | **Gemini 2.5** | - | 9.9 | 13.1 | 8.2 | 19.6 | 15.1 | 18.8 | 6.5 | 13.0 | 0.29 | 46.1 |
396
+ | **SAM3** | 77 | 47.3 | 53.7 | 61.1 | 53.4 | 65.5 | 54.9 | 42.5 | 54.1 | 0.82 | 66.1 |
397
+ | **SAM3-LiteText-S0** | 16 | 47.06 | 53.42 | 60.58 | 52.18 | 65.05 | 54.86 | 42.12 | 53.61 | 0.81 | 65.54 |
398
+ | **SAM3-LiteText-S1** | 16 | 47.18 | 53.58 | 60.76 | 52.43 | 65.28 | 55.02 | 42.35 | 53.80 | 0.81 | 65.72 |
399
+ | **SAM3-LiteText-L** | 16 | 47.24 | 53.66 | 60.88 | 52.65 | 65.49 | 55.19 | 42.54 | 53.95 | 0.81 | 65.87 |
400
+
401
+ > **Note:** This table shows performance of the released ctx-16 models, which were trained with a more extensive dataset mixture compared to the models reported in the paper. As a result, performance may differ slightly from the values in the associated publication.
402
+
403
+ </details>
404
+
405
+ ---
406
+
407
+
408
+ ## CoreML / ONNX Export
409
+
410
+ Coming soon: export pipelines to ONNX and CoreML for cross-platform deployment.
411
+
412
+ ---
413
+
414
+ ## Web Demo
415
+
416
+ Coming soon: an interactive web demo for real-time concept segmentation and tracking.
417
+
418
+ ---
419
+ ## Development To-Do List
420
+
421
+ - [x] **Release Stage 1 Image Encoder Weights**: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
422
+ - [x] **Release Stage 1 Text Encoder Weights**: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
423
+ - [x] **Release Stage 1+ Fine-Tuned Encoder Weights**: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
424
+ - [x] **Release SAM3-LiteText Weights**: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
425
+ - [ ] **Release Stage 2 Memory Bank Aligned Model Weights**: Models with Perceiver-based memory compression trained on SA-V dataset
426
+ - [ ] **Release Stage 3 Fine-Tuned Model Weights**: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
427
+ - [ ] **ONNX/CoreML Export**: Export models to ONNX and CoreML formats for cross-platform deployment
428
+ - [ ] **Web Demo**: Interactive web demonstration for real-time concept segmentation and tracking
429
+
430
+ ---
431
+
432
+ ## Call for Pull Requests
433
+ The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in [this paper](https://ieeexplore.ieee.org/abstract/document/11084428). Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.
434
+
435
+ We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for:
436
+ - Efficient MedSAM3 integration (see [MedSAM2 by Bo Wang Lab](https://github.com/bowang-lab/MedSAM2))
437
+ - A Gradio demo (e.g. [EfficientTAM on Hugging Face Spaces](https://huggingface.co/spaces/yunyangx/EfficientTAM))
438
+ - A web demo deployed with Vercel (e.g. [Segment Anything Web UI](https://segment-anything-webui.vercel.app/))
439
+ - Annotation tools, such as [X-AnyLabeling](https://github.com/CVHub520/X-AnyLabeling) and [AnyLabeling](https://github.com/vietanhdev/anylabeling)
440
+ - An iOS or Android app (e.g. [Cutcha Photo on the App Store](https://apps.apple.com/us/app/cutcha-photo/id6478521132))
441
+ - An NVCC-based desktop application
442
+ - Anything else that you think is cool!
443
+ ---
444
+
445
+ All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.
446
+
447
  ## Citation
448
 
449
+ If you use EfficientSAM3 in your research, please cite:
450
 
451
  ```bibtex
452
  @misc{zeng2025efficientsam3progressivehierarchicaldistillation,
453
+ title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
454
+ author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
455
+ year={2025},
456
+ eprint={2511.15833},
457
+ archivePrefix={arXiv},
458
+ primaryClass={cs.CV},
459
+ url={https://arxiv.org/abs/2511.15833},
460
+ }
461
+ ```
462
+
463
+ ```bibtex
464
+ @misc{zeng2026sam3litetextanatomicalstudysam3,
465
+ title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
466
+ author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
467
+ year={2026},
468
+ eprint={2602.12173},
469
+ archivePrefix={arXiv},
470
+ primaryClass={cs.AI},
471
+ url={https://arxiv.org/abs/2602.12173},
472
  }
473
+ ```
474
+
475
+ ## License
476
+
477
+ This repository is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
478
+
479
+ This project builds upon [SAM](https://github.com/facebookresearch/segment-anything), [SAM2](https://github.com/facebookresearch/sam2), [SAM3](https://github.com/facebookresearch/sam3), [EdgeSAM](https://github.com/chongzhou96/EdgeSAM), [EdgeTAM](https://github.com/facebookresearch/EdgeTAM), [EfficientTAM](https://github.com/yformer/EfficientTAM), [RepViT](https://github.com/THU-MIG/RepViT), [TinyViT](https://github.com/wkcn/TinyViT), [EfficientViT](https://github.com/mit-han-lab/efficientvit), and [MobileCLIP](https://github.com/apple/ml-mobileclip). Please refer to their respective licenses for usage terms.
480
+
481
+
482
+ ## Acknowledgments
483
+
484
+ We gratefully acknowledge the [University of Bristol Isambard-AI supercomputer cluster](https://www.bristol.ac.uk/research/centres/bristol-supercomputing/articles/2025/isambard-ai-is-11th-fastest-supercomputer-in-the-world.html) for providing computational resources to this project. Special thanks to [Dr. Fan Aaron Zhang](https://fan-aaron-zhang.github.io/) for allocating resources and supporting this research.
485
+
486
+ ---
487
+
488
+ ## Users
489
+
490
+ Organizations and projects using EfficientSAM3:
491
+
492
+ <table>
493
+ <tr>
494
+ <td align="center" width="20%">
495
+ <img src="https://github.com/SimonZeng7108/simonzeng7108.github.io/blob/main/efficientsam3/static/images/esa.png" alt="European Space Agency" height="80"><br>
496
+ <a href="https://www.esa.int/">European Space Agency</a>
497
+ </td>
498
+ </tr>
499
+ </table>
500
+
501
+ > **Note:** If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.