File size: 5,636 Bytes
1b9f061 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
license: apache-2.0
tags:
- image-segmentation
- segment-anything
- segment-anything-3
- open-vocabulary
- text-to-segmentation
- onnx
- onnxruntime
library_name: onnxruntime
base_model:
- facebook/sam3
---
# Segment Anything 3 (SAM 3) β ONNX Models
ONNX-exported version of Meta's **Segment Anything Model 3 (SAM 3)**, an open-vocabulary segmentation model that accepts **text prompts** in addition to points and rectangles.
SAM 3 uses a CLIP-based language encoder to let you describe objects in natural language (e.g., `"truck"`, `"person with hat"`) and segment them without task-specific training.
These models are used by **[AnyLabeling](https://github.com/vietanhdev/anylabeling)** for AI-assisted image annotation, and exported by **[samexporter](https://github.com/vietanhdev/samexporter)**.
## Available Models
| File | Contents | Description |
|------|----------|-------------|
| `sam3_vit_h.zip` | 3 ONNX files | SAM 3 ViT-H (all components) |
The zip contains three ONNX components that work together:
| ONNX File | Role | Runs |
|-----------|------|------|
| `sam3_image_encoder.onnx` | Extracts visual features from the input image | Once per image |
| `sam3_language_encoder.onnx` | Encodes text prompt tokens into feature vectors | Once per text query |
| `sam3_decoder.onnx` | Produces segmentation masks given image + language features | Per prompt |
## Prompt Types
SAM 3 supports **three prompt modalities**:
| Prompt | Description |
|--------|-------------|
| **Text** | Natural-language description, e.g. `"truck"` β unique to SAM 3 |
| **Point** | Click `+point` / `-point` to include/exclude regions |
| **Rectangle** | Draw a bounding box around the target object |
Text prompts are the recommended workflow: they drive detection open-vocabulary style, so you can label **any object class** without retraining.
## Use with AnyLabeling (Recommended)
[AnyLabeling](https://github.com/vietanhdev/anylabeling) is a desktop annotation tool with a built-in model manager that downloads, caches, and runs these models automatically β no coding required.
1. Install: `pip install anylabeling`
2. Launch: `anylabeling`
3. Click the **Brain** button β select **Segment Anything 3 (ViT-H)** from the dropdown
4. Type a text description (e.g., `truck`) in the text prompt field
5. Optionally refine with point/rectangle prompts
[](https://github.com/vietanhdev/anylabeling)
## Use Programmatically with ONNX Runtime
```python
import urllib.request, zipfile
url = "https://huggingface.co/vietanhdev/segment-anything-3-onnx-models/resolve/main/sam3_vit_h.zip"
urllib.request.urlretrieve(url, "sam3_vit_h.zip")
with zipfile.ZipFile("sam3_vit_h.zip") as z:
z.extractall("sam3")
```
Then use [samexporter](https://github.com/vietanhdev/samexporter)'s inference module:
```bash
pip install samexporter
# Text prompt
python -m samexporter.inference \
--sam_variant sam3 \
--encoder_model sam3/sam3_image_encoder.onnx \
--decoder_model sam3/sam3_decoder.onnx \
--language_encoder_model sam3/sam3_language_encoder.onnx \
--image photo.jpg \
--prompt prompt.json \
--text_prompt "truck" \
--output result.png
```
Example `prompt.json` for a text-only query:
```json
[{"type": "text", "data": "truck"}]
```
## Model Architecture
SAM 3 follows the same encoder/decoder pattern as SAM and SAM 2, with an added CLIP-based language branch:
```
Input image βββΊ Image Encoder βββββββββββββββββββββββββββ
βΌ
Text prompt βββΊ Language Encoder βββΊ Decoder βββΊ Masks + Scores + Boxes
β²
Optional: point / box prompts βββββββββββ
```
The **image encoder** runs once per image and caches features. The **language encoder** runs once per text query. The **decoder** is lightweight and runs interactively for each prompt combination.
## Re-export from Source
To re-export or customize the models using [samexporter](https://github.com/vietanhdev/samexporter):
```bash
pip install samexporter
# Export all three SAM 3 ONNX components
python -m samexporter.export_sam3 --output_dir output_models/sam3
# Or use the convenience script:
bash convert_sam3.sh
```
## Custom Model Config for AnyLabeling
To use a locally re-exported SAM 3 as a custom model in AnyLabeling, create a `config.yaml`:
```yaml
type: segment_anything
name: sam3_vit_h_custom
display_name: Segment Anything 3 (ViT-H)
encoder_model_path: sam3_image_encoder.onnx
decoder_model_path: sam3_decoder.onnx
language_encoder_path: sam3_language_encoder.onnx
input_size: 1008
max_height: 1008
max_width: 1008
```
Then load it via **Brain button β Load Custom Model** in AnyLabeling.
## Related Repositories
| Repo | Description |
|------|-------------|
| [vietanhdev/samexporter](https://github.com/vietanhdev/samexporter) | Export scripts, inference code, conversion tools |
| [vietanhdev/anylabeling](https://github.com/vietanhdev/anylabeling) | Desktop annotation app powered by these models |
| [facebook/sam3](https://huggingface.co/facebook/sam3) | Original SAM 3 PyTorch checkpoint by Meta |
## License
The ONNX models are derived from Meta's SAM 3, released under the **[SAM License](https://github.com/facebookresearch/sam3/blob/main/LICENSE)**.
The export code is part of [samexporter](https://github.com/vietanhdev/samexporter), released under the **MIT** license. |