Instructions to use danilobukvic/sam3-text-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use danilobukvic/sam3-text-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("mask-generation", model="danilobukvic/sam3-text-onnx")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("danilobukvic/sam3-text-onnx", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- SAM 3 β Text-Promptable Image Segmentation (ONNX)
- Why this exists
- What's in this repo
- Architecture
- Usage β Python with onnxruntime
- Usage β browser with onnxruntime-web
- Input/output contracts
- Which precision should I use?
- Performance
- Validation
- Quantization results
- Known caveats and TODOs
- How this was built
- Reproducing this export
- License
- Citation
- Acknowledgments
- Why this exists
SAM 3 β Text-Promptable Image Segmentation (ONNX)
ONNX export of Meta's SAM 3 image model β the text-promptable concept segmentation variant. Run open-vocabulary image segmentation from a text prompt ("seed", "cat", "yellow school bus") in any environment that has an ONNX runtime: Python, C++, Rust, browsers via WebAssembly/WebGPU.
To my knowledge this is the first public ONNX export of SAM 3 image with text prompts. Other community exports (e.g. onnx-community/sam3-tracker-ONNX) cover the tracker variant which only accepts point/box prompts.
Why this exists
facebook/sam3 is published in PyTorch only. Running it in browsers, mobile, or any non-Python environment requires ONNX. As of this export's publication:
optimum-onnxdoes not have native SAM 3 support (the CLI fails withTrying to export a sam3_video model, that is a custom or unsupported architecture).- Meta's
facebookresearch/sam3repository ships no ONNX export tooling. - The existing community ONNX export (
vietanhdev/segment-anything-3-onnx-models) targets Pythononnxruntimeonly and isn't compatible withtransformers.js/onnxruntime-web. - SegmentLens (sam3.ai) paywalls text-prompt SAM 3 behind server-side cloud inference rather than shipping it in-browser.
This export was produced by hand-wrapping the three sub-modules of Sam3Model and calling torch.onnx.export directly on each. Validated end-to-end against the original PyTorch model β bit-equivalent detection count and box locations on a held-out test image.
What's in this repo
sam3-text-onnx/
β
ββ fp32 reference (3.3 GB total) β bit-equivalent to PyTorch
β βββ vision_encoder.onnx (6.2 MB graph)
β βββ vision_encoder.onnx.data (1.84 GB weights)
β βββ text_encoder.onnx (3.0 MB graph)
β βββ text_encoder.onnx.data (1.35 GB weights)
β βββ decoder.onnx (96 MB, weights inline)
β
ββ int8 dynamic quantization (839 MB total) β production default
β βββ vision_encoder_int8.onnx (473 MB)
β βββ text_encoder_int8.onnx (340 MB)
β βββ decoder_int8.onnx (26 MB)
β
ββ int4 MatMul quantization (654 MB total) β browser/mobile
β βββ vision_encoder_int4.onnx (5.6 MB graph)
β βββ vision_encoder_int4.onnx.data (279 MB weights)
β βββ text_encoder_int4.onnx (2.7 MB graph)
β βββ text_encoder_int4.onnx.data (348 MB weights)
β βββ decoder_int4.onnx (19 MB)
β
ββ Export scripts (PyTorch β ONNX)
β βββ export_sam3_vision.py # produces vision_encoder.onnx
β βββ export_sam3_text.py # produces text_encoder.onnx
β βββ export_sam3_decoder.py # produces decoder.onnx
β
ββ Quantization scripts (ONNX fp32 β smaller)
β βββ quantize_sam3_fp16.py # β οΈ currently broken, see Known Caveats
β βββ quantize_sam3_int8.py
β βββ quantize_sam3_int4.py
β
ββ Validation scripts
βββ validate_sam3_e2e.py # end-to-end pipeline test
βββ validate_all_variants.py # compares fp32/int8/int4 side-by-side
Three precision variants, each independently usable. See Which precision should I use? for guidance.
Architecture
SAM 3 is structured so the vision encoder runs once per image and produces multi-scale FPN features. The text encoder runs once per prompt. The decoder consumes both and produces masks/boxes/scores. This means changing the prompt while keeping the same image only re-runs the cheap text + decoder path:
image βββΊ vision_encoder.onnx βββββββββββββββββ
βΌ
text prompt βββΊ text_encoder.onnx βββΊ decoder.onnx βββΊ pred_masks
pred_boxes (xyxy normalized)
pred_logits (sigmoid β scores)
Usage β Python with onnxruntime
import numpy as np
import onnxruntime as ort
from PIL import Image
from transformers import AutoTokenizer
from transformers.models.sam3.image_processing_sam3 import Sam3ImageProcessor
MODEL_ID = "facebook/sam3" # for the preprocessors only; weights come from ONNX
# Preprocessors (still come from HF β they're tiny and have no ONNX equivalent)
image_processor = Sam3ImageProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Load the three ONNX components
vision_sess = ort.InferenceSession("vision_encoder.onnx", providers=["CPUExecutionProvider"])
text_sess = ort.InferenceSession("text_encoder.onnx", providers=["CPUExecutionProvider"])
decoder_sess = ort.InferenceSession("decoder.onnx", providers=["CPUExecutionProvider"])
# Preprocess
image = Image.open("your-image.png").convert("RGB")
pixel_values = image_processor(images=image, return_tensors="np")["pixel_values"]
encoded = tokenizer("seed", return_tensors="np", padding="max_length", max_length=32, truncation=True)
input_ids = encoded["input_ids"].astype(np.int64)
attention_mask = encoded["attention_mask"].astype(np.int64)
# 1. Vision encoder β produces 4 FPN feature maps + position encodings
v_out = vision_sess.run(None, {"pixel_values": pixel_values})
fpn_hidden_states = v_out[0:4] # spatial scales 288, 144, 72, 36 at 1008x1008 input
fpn_position_encoding = v_out[4:8]
# 2. Text encoder β projects "seed" to a 256-dim feature
text_features = text_sess.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
})[0]
# 3. Decoder β uses first 3 FPN levels + only the last position encoding
pred_masks, pred_boxes, pred_logits = decoder_sess.run(None, {
"fpn_hidden_state_0": fpn_hidden_states[0],
"fpn_hidden_state_1": fpn_hidden_states[1],
"fpn_hidden_state_2": fpn_hidden_states[2],
"fpn_position_encoding_2": fpn_position_encoding[2],
"text_features": text_features,
"attention_mask": attention_mask,
})
# Convert logits to scores in [0, 1]
scores = 1.0 / (1.0 + np.exp(-pred_logits))
# Filter by confidence
keep = scores[0] > 0.5
print(f"Detections: {keep.sum()}")
print(f"Boxes (xyxy normalized): {pred_boxes[0, keep]}")
print(f"Scores: {scores[0, keep]}")
A complete runnable version is in validate_sam3_e2e.py.
Usage β browser with onnxruntime-web
The same recipe in JavaScript. Note: transformers.js does not currently have a Sam3Model JS class for the image variant, so you call onnxruntime-web directly. You still need transformers.js for the tokenizer.
import { AutoTokenizer } from "@huggingface/transformers";
import * as ort from "onnxruntime-web";
const tokenizer = await AutoTokenizer.from_pretrained("facebook/sam3");
const visionSess = await ort.InferenceSession.create(
"https://huggingface.co/danilobukvic/sam3-text-onnx/resolve/main/vision_encoder.onnx",
{ executionProviders: ["webgpu"] }
);
const textSess = await ort.InferenceSession.create(
"https://huggingface.co/danilobukvic/sam3-text-onnx/resolve/main/text_encoder.onnx",
{ executionProviders: ["webgpu"] }
);
const decoderSess = await ort.InferenceSession.create(
"https://huggingface.co/danilobukvic/sam3-text-onnx/resolve/main/decoder.onnx",
{ executionProviders: ["webgpu"] }
);
// Preprocess image to [1, 3, 1008, 1008] Float32Array with ImageNet normalization
// (see Sam3ImageProcessor for exact mean/std values β you'll need to port this)
const pixelValues = preprocessImage(imageBitmap);
// Tokenize prompt to length 32
const { input_ids, attention_mask } = await tokenizer("seed", {
padding: "max_length",
max_length: 32,
truncation: true,
});
// Run the pipeline (same as Python above) ...
Input/output contracts
vision_encoder.onnx
| Name | Shape | Type | Notes | |
|---|---|---|---|---|
| In | pixel_values |
[batch, 3, 1008, 1008] |
float32 | ImageNet-normalized. Size is fixed: SAM 3 precomputes positional embeddings for 1008Γ1008. |
| Out | fpn_hidden_state_0..3 |
[batch, 256, H, W] |
float32 | Spatial scales: 288, 144, 72, 36 (for 1008 input) |
| Out | fpn_position_encoding_0..3 |
[batch, 256, H, W] |
float32 | Matching position encodings |
text_encoder.onnx
| Name | Shape | Type | Notes | |
|---|---|---|---|---|
| In | input_ids |
[batch, 32] |
int64 | CLIP-style tokens; SAM 3 uses max_position_embeddings=32 (shorter than standard CLIP's 77) |
| In | attention_mask |
[batch, 32] |
int64 | 1 for real tokens, 0 for padding |
| Out | text_features |
[batch, 32, 256] |
float32 | Projected from CLIP's 1024-dim to SAM 3's 256-dim DETR space |
decoder.onnx
| Name | Shape | Type | Notes | |
|---|---|---|---|---|
| In | fpn_hidden_state_0 |
[batch, 256, 288, 288] |
float32 | From vision encoder (largest scale) |
| In | fpn_hidden_state_1 |
[batch, 256, 144, 144] |
float32 | From vision encoder |
| In | fpn_hidden_state_2 |
[batch, 256, 72, 72] |
float32 | From vision encoder (smallest scale) |
| In | fpn_position_encoding_2 |
[batch, 256, 72, 72] |
float32 | Only the smallest scale's PE is used (others were optimized out by the tracer) |
| In | text_features |
[batch, 32, 256] |
float32 | From text encoder |
| In | attention_mask |
[batch, 32] |
int64 | Same mask passed to text encoder |
| Out | pred_masks |
[batch, 200, 288, 288] |
float32 | 200 query slots, each a low-resolution mask |
| Out | pred_boxes |
[batch, 200, 4] |
float32 | Boxes in xyxy format, normalized to [0, 1] |
| Out | pred_logits |
[batch, 200] |
float32 | Apply sigmoid for [0, 1] confidence scores |
Which precision should I use?
Three precision variants are published. They share the same input/output contracts above β pick whichever matches your constraints.
| Need⦠| Use | Why |
|---|---|---|
| Reference accuracy / research baseline | fp32 | Bit-exact match against PyTorch. Reproducible numbers. |
| Production server inference (CPU or GPU > 6 GB VRAM) | int8 | 4Γ smaller than fp32, same top detections, scores within 0.001 of baseline. AVX-512 VNNI and CUDA both accelerate int8 ops natively. |
| Browser / mobile / GPU with β€ 4 GB VRAM | int4 | 5Γ smaller than fp32, keeps the right top-N detections, modest quality loss in low-confidence range. |
| GPU inference on a 4 GB consumer card | int8 or int4 (not fp32) | fp32 exhausts VRAM via intermediate Softmax buffers (~1.7 GB). int8 and int4 both fit. |
| Fastest CPU inference | int8 | Modern CPUs accelerate int8 GEMMs via AVX-VNNI; ~3Γ faster than fp32 even on CPU. |
Default recommendation: int8. It's the same quality as fp32 in practice but a quarter of the size and several times faster. Only use fp32 if you need bit-exact reproducibility for academic work. Only use int4 if you're shipping to a context where every megabyte matters (browsers, mobile, embedded).
Validation evidence
On a microscope-style seed image with prompt "seed" (the test image used during development):
| Variant | Detections > 0.5 | Max score | Top 5 scores |
|---|---|---|---|
| fp32 (baseline) | 12 | 0.926 | 0.926, 0.889, 0.878, 0.874, 0.868 |
| int8 | 12 | 0.925 | 0.925, 0.905, 0.902, 0.898, 0.872 |
| int4 | 12 | 0.898 | 0.898, 0.863, 0.844, 0.797, 0.784 |
int8 is essentially indistinguishable from fp32. int4 keeps the right top detections but scores are ~3% lower with more noise in the low-confidence range.
Performance
Per-component timings for a single 1008Γ1008 image. Measured on a laptop with a quad-core Intel CPU and an RTX 3050 Laptop (4 GB VRAM):
| Variant | vision (CPU) | vision (GPU) | text | decoder (CPU) | Total (best path) |
|---|---|---|---|---|---|
| fp32 | 181 s | OOM on 4 GB | 7 s | 13 s | 201 s (CPU only) |
| int8 | 53 s | 27 s | 1 s | 14 s | 42 s (hybrid) |
| int4 | 166 s | 12 s | 0.1 s | 16 s | 28 s (hybrid) |
The GPU+CPU hybrid pattern
On consumer GPUs with limited VRAM, the vision encoder benefits massively from GPU (it's a heavy ViT) but the decoder runs better on CPU because its attention layers need ~860 MB intermediate buffers that don't fit alongside the encoder in VRAM. The hybrid setup:
vsess = ort.InferenceSession("vision_encoder_int4.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
tsess = ort.InferenceSession("text_encoder_int4.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
dsess = ort.InferenceSession("decoder_int4.onnx",
providers=["CPUExecutionProvider"]) # β decoder on CPU
With this split, the int4 vision encoder runs in ~12 seconds on an RTX 3050 (vs. 166 seconds CPU-only) β a 14Γ speedup on the bottleneck. Total inference drops from ~3 minutes to ~30 seconds.
On a server-class GPU (16+ GB VRAM) all three can run on GPU and inference is sub-second.
Prompt-iteration architecture
The decoder uses pre-computed vision features as inputs, so swapping the prompt while keeping the image only re-runs the cheap text + decoder path (~15 s end-to-end, vs ~30+ seconds for a fresh image). This makes interactive use viable.
Validation
The fp32 ONNX export was compared end-to-end against the original PyTorch Sam3Model.forward() on a microscope-style seed image with prompt "seed". Detection count, top scores, and box locations matched bit-for-bit within tracing noise.
The quantized variants (int8, int4) were then compared against the fp32 ONNX baseline β see the precision selection table for results.
To reproduce on your own image:
python validate_sam3_e2e.py path/to/image.png "seed" # single variant
python validate_all_variants.py path/to/image.png "seed" # all 3 side-by-side
Quantization results
Three precision variants are published in this repo. Actual measured sizes:
| Component | fp32 | int8 (dynamic) | int4 (MatMul block) |
|---|---|---|---|
| vision_encoder | 1.84 GB | 473 MB (3.7Γ) | 285 MB (6.5Γ) |
| text_encoder | 1.35 GB | 340 MB (4.0Γ) | 350 MB (3.9Γ) |
| decoder | 96 MB | 26 MB (3.6Γ) | 19 MB (4.9Γ) |
| Total | 3.3 GB | 839 MB (4.0Γ) | 654 MB (5.0Γ) |
Notes on each variant
int8 β quantize_sam3_int8.py uses onnxruntime.quantization.quantize_dynamic with QuantType.QInt8. Per-tensor symmetric quantization of MatMul/Gemm weights. No calibration data needed. Outputs are bit-equivalent to fp32 in practice.
int4 β quantize_sam3_int4.py uses MatMulNBitsQuantizer with block_size=128, bits=4, is_symmetric=True. Only quantizes MatMul ops; other ops stay fp32. This is the same scheme HuggingFace uses with dtype: "q4" in transformers.js, and what onnx-community/sam3-tracker-ONNX uses for its vision encoder.
Note on text encoder size: int4 only saves a small amount over int8 for the text encoder because most of its size is in the CLIP token embedding table (a lookup matrix, not a MatMul). The block-wise int4 scheme targets MatMul weights specifically, so embeddings stay fp32 in both.
fp16 status
A fully-working fp16 variant is not currently published. onnxconverter_common.float16.convert_float_to_float16 produces type-mismatch errors when applied to the decoder β PyTorch's traced Cast ops create incompatible type boundaries that the converter doesn't propagate cleanly through. Blocking Cast/CastLike from conversion fixes the vision encoder but the decoder has additional Mul ops with the same problem.
For practical purposes, int8 dominates fp16 anyway: int8 is half the size with equivalent quality. fp16 has no real use case here. See quantize_sam3_fp16.py for the current (partially working) attempt if you want to try fixing it.
Known caveats and TODOs
- Fixed input size: The vision encoder's positional embeddings are precomputed for 1008Γ1008 input. Other sizes will produce shape mismatches. Use the SAM 3 image processor to handle resizing/padding.
- Geometry prompts not supported: This export covers the text-only path. SAM 3's optional box/point prompts are not wired in β would need a separate decoder variant.
- Dynamic batch is dynamic but untested: The export uses symbolic batch dim but I've only validated batch=1. Higher batch sizes should work but no guarantees.
- fp16 export is broken: See fp16 status. int8 is the recommended half-precision alternative.
- fp32 on 4 GB VRAM: The fp32 vision encoder's Softmax allocates a ~1.7 GB intermediate buffer; on consumer GPUs this exhausts VRAM. Use int8/int4 for GPU, or fp32 on CPU only.
- Tracer warnings during export: A few
TracerWarning: Converting a tensor to a Python boolean...were emitted. These bake config flags (e.g.is_causal, attention backend selection) into the graph at export time. Fine for inference with the same model config, but means the ONNX isn't reusable across configs. - No
transformers.jsintegration: At time of writing,transformers.jsonly hasSam3TrackerModel(point/box) but notSam3Model(text-promptable image). Until that lands, useonnxruntime-webdirectly. - Tracer optimized out unused inputs: The decoder ONNX accepts only
fpn_position_encoding_2(not_0or_1) because those weren't actually used in the traced forward path. Inspectdecoder_sess.get_inputs()for the canonical input names.
How this was built
Three Python scripts, one per sub-module, each calling torch.onnx.export on a thin wrapper around the SAM 3 component:
export_sam3_vision.pyβ wrapsSam3VisionModel, flattensSam3VisionEncoderOutputinto a tuple of tensors. Uses the default (dynamo) exporter.export_sam3_text.pyβ wrapsCLIPTextModelWithProjection+ thetext_projectionLinear layer. Key gotcha: SAM 3's text config hasmax_position_embeddings=32(shorter than standard CLIP's 77).export_sam3_decoder.pyβ wrapsdetr_encoder+detr_decoder+mask_decoder+dot_product_scoring+ box head as a single pipeline. Usesdynamo=False(legacy tracer) β the dynamo exporter crashes on the attention reshape-after-transpose pattern (Cannot view a tensor with shape (1, 201, 8, 32) ... as a tensor with shape (s2, 201, 256)).
Quantization scripts (added after the initial export):
quantize_sam3_int8.pyβonnxruntime.quantization.quantize_dynamicwithQInt8weight type. Works on all three components out of the box.quantize_sam3_int4.pyβMatMulNBitsQuantizerfromonnxruntime.quantization.matmul_nbits_quantizer.block_size=128,bits=4, symmetric. Only quantizes MatMul ops; other ops stay fp32.quantize_sam3_fp16.pyβ usesonnxconverter_common.float16.convert_float_to_float16withop_block_list=["Cast", "CastLike"]. Fixes the vision encoder but the decoder still has type mismatches atMulops; partial fix only.
See validate_sam3_e2e.py and validate_all_variants.py for the end-to-end pipeline tests.
Reproducing this export
If you want to redo the export from scratch (e.g. to target a different opset, change the input size, or add geometry prompts), here's the full sequence.
Prerequisites
You need transformers v5 (which is in pre-release as of writing) and a build of optimum-onnx from source. The stable optimum releases require transformers v4 and don't know about SAM 3. Install from source with strict dependency control:
py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
# Clone transformers from source β v5 is required for Sam3Model class
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install .[torch]
cd ..
# Upgrade huggingface_hub to v1+ (transformers v5 requires it; --no-deps avoids
# pulling old optimum back down)
pip install --upgrade --no-deps "huggingface_hub>=1.0"
# Install optimum-onnx from source with --no-deps to preserve transformers v5
pip install git+https://github.com/huggingface/optimum-onnx.git --no-deps
# Remaining deps
pip install onnx onnxruntime onnxscript onnx_ir pillow torchvision opencv-python --no-cache-dir
Verify the install:
python -c "from transformers import Sam3Model, Sam3Processor; print('SAM3 OK')"
python -c "from optimum.exporters.onnx import main_export; print('optimum OK')"
You'll also need a HuggingFace token to download facebook/sam3:
$env:HF_TOKEN = "hf_xxx"
Run the export
# fp32 export (~15 min total on CPU)
python export_sam3_vision.py # ~5-10 min, produces vision_encoder.onnx + .data
python export_sam3_text.py # ~2 min, produces text_encoder.onnx + .data
python export_sam3_decoder.py # ~3 min, produces decoder.onnx
# Validate against PyTorch (~3 min, only do once)
python validate_sam3_e2e.py path/to/test-image.png "your prompt"
Quantize (optional but recommended)
python quantize_sam3_int8.py # ~2 min, produces *_int8.onnx
python quantize_sam3_int4.py # ~3 min, produces *_int4.onnx
python validate_all_variants.py path/to/test-image.png "your prompt"
Hardware notes
- CPU is sufficient for both export and inference. GPU helps for fast iteration during inference but isn't needed for the export itself.
- At least 8 GB system RAM is needed during the vision encoder export β the model is ~3.4 GB and the tracer holds intermediate state.
onnxruntime-gpurequires CUDA 12.x runtime libraries. If you have an NVIDIA driver supporting CUDA 12.x but no CUDA toolkit installed,pip install nvidia-cudnn-cu12 nvidia-cublas-cu12 nvidia-cuda-runtime-cu12 nvidia-cufft-cu12brings the needed DLLs.
License
This export inherits the license of the original facebook/sam3 weights: the SAM 3 License Agreement. Read it before using these weights β it includes restrictions on commercial use.
Citation
The underlying model is Meta's SAM 3:
@article{ravi2025sam3,
title = {SAM 3: Segment Anything with Concepts},
author = {Ravi, Nikhila and others},
journal = {arXiv preprint arXiv:2511.16719},
year = {2025}
}
If this ONNX export is useful to you, a star on the repo or a mention is appreciated.
Acknowledgments
- Meta AI for releasing SAM 3 and its weights
- HuggingFace for the
transformersintegration that made the architecture introspectable - PyTorch team for
torch.onnx.export(especially the legacydynamo=Falsepath, which is what got the decoder across the line)
Model tree for danilobukvic/sam3-text-onnx
Base model
facebook/sam3