SAM 3 β€” Text-Promptable Image Segmentation (ONNX)

ONNX export of Meta's SAM 3 image model β€” the text-promptable concept segmentation variant. Run open-vocabulary image segmentation from a text prompt ("seed", "cat", "yellow school bus") in any environment that has an ONNX runtime: Python, C++, Rust, browsers via WebAssembly/WebGPU.

To my knowledge this is the first public ONNX export of SAM 3 image with text prompts. Other community exports (e.g. onnx-community/sam3-tracker-ONNX) cover the tracker variant which only accepts point/box prompts.

Why this exists

facebook/sam3 is published in PyTorch only. Running it in browsers, mobile, or any non-Python environment requires ONNX. As of this export's publication:

  • optimum-onnx does not have native SAM 3 support (the CLI fails with Trying to export a sam3_video model, that is a custom or unsupported architecture).
  • Meta's facebookresearch/sam3 repository ships no ONNX export tooling.
  • The existing community ONNX export (vietanhdev/segment-anything-3-onnx-models) targets Python onnxruntime only and isn't compatible with transformers.js / onnxruntime-web.
  • SegmentLens (sam3.ai) paywalls text-prompt SAM 3 behind server-side cloud inference rather than shipping it in-browser.

This export was produced by hand-wrapping the three sub-modules of Sam3Model and calling torch.onnx.export directly on each. Validated end-to-end against the original PyTorch model β€” bit-equivalent detection count and box locations on a held-out test image.

What's in this repo

sam3-text-onnx/
β”‚
β”œβ”€ fp32 reference (3.3 GB total) β€” bit-equivalent to PyTorch
β”‚  β”œβ”€β”€ vision_encoder.onnx               (6.2 MB graph)
β”‚  β”œβ”€β”€ vision_encoder.onnx.data          (1.84 GB weights)
β”‚  β”œβ”€β”€ text_encoder.onnx                 (3.0 MB graph)
β”‚  β”œβ”€β”€ text_encoder.onnx.data            (1.35 GB weights)
β”‚  └── decoder.onnx                      (96 MB, weights inline)
β”‚
β”œβ”€ int8 dynamic quantization (839 MB total) β€” production default
β”‚  β”œβ”€β”€ vision_encoder_int8.onnx          (473 MB)
β”‚  β”œβ”€β”€ text_encoder_int8.onnx            (340 MB)
β”‚  └── decoder_int8.onnx                 (26 MB)
β”‚
β”œβ”€ int4 MatMul quantization (654 MB total) β€” browser/mobile
β”‚  β”œβ”€β”€ vision_encoder_int4.onnx          (5.6 MB graph)
β”‚  β”œβ”€β”€ vision_encoder_int4.onnx.data     (279 MB weights)
β”‚  β”œβ”€β”€ text_encoder_int4.onnx            (2.7 MB graph)
β”‚  β”œβ”€β”€ text_encoder_int4.onnx.data       (348 MB weights)
β”‚  └── decoder_int4.onnx                 (19 MB)
β”‚
β”œβ”€ Export scripts (PyTorch β†’ ONNX)
β”‚  β”œβ”€β”€ export_sam3_vision.py             # produces vision_encoder.onnx
β”‚  β”œβ”€β”€ export_sam3_text.py               # produces text_encoder.onnx
β”‚  └── export_sam3_decoder.py            # produces decoder.onnx
β”‚
β”œβ”€ Quantization scripts (ONNX fp32 β†’ smaller)
β”‚  β”œβ”€β”€ quantize_sam3_fp16.py             # ⚠️ currently broken, see Known Caveats
β”‚  β”œβ”€β”€ quantize_sam3_int8.py
β”‚  └── quantize_sam3_int4.py
β”‚
└─ Validation scripts
   β”œβ”€β”€ validate_sam3_e2e.py              # end-to-end pipeline test
   └── validate_all_variants.py          # compares fp32/int8/int4 side-by-side

Three precision variants, each independently usable. See Which precision should I use? for guidance.

Architecture

SAM 3 is structured so the vision encoder runs once per image and produces multi-scale FPN features. The text encoder runs once per prompt. The decoder consumes both and produces masks/boxes/scores. This means changing the prompt while keeping the same image only re-runs the cheap text + decoder path:

image ──► vision_encoder.onnx ────────────────┐
                                              β–Ό
text prompt ──► text_encoder.onnx ──► decoder.onnx ──► pred_masks
                                                       pred_boxes (xyxy normalized)
                                                       pred_logits (sigmoid β†’ scores)

Usage β€” Python with onnxruntime

import numpy as np
import onnxruntime as ort
from PIL import Image
from transformers import AutoTokenizer
from transformers.models.sam3.image_processing_sam3 import Sam3ImageProcessor

MODEL_ID = "facebook/sam3"  # for the preprocessors only; weights come from ONNX

# Preprocessors (still come from HF β€” they're tiny and have no ONNX equivalent)
image_processor = Sam3ImageProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load the three ONNX components
vision_sess = ort.InferenceSession("vision_encoder.onnx", providers=["CPUExecutionProvider"])
text_sess = ort.InferenceSession("text_encoder.onnx", providers=["CPUExecutionProvider"])
decoder_sess = ort.InferenceSession("decoder.onnx", providers=["CPUExecutionProvider"])

# Preprocess
image = Image.open("your-image.png").convert("RGB")
pixel_values = image_processor(images=image, return_tensors="np")["pixel_values"]

encoded = tokenizer("seed", return_tensors="np", padding="max_length", max_length=32, truncation=True)
input_ids = encoded["input_ids"].astype(np.int64)
attention_mask = encoded["attention_mask"].astype(np.int64)

# 1. Vision encoder β€” produces 4 FPN feature maps + position encodings
v_out = vision_sess.run(None, {"pixel_values": pixel_values})
fpn_hidden_states = v_out[0:4]       # spatial scales 288, 144, 72, 36 at 1008x1008 input
fpn_position_encoding = v_out[4:8]

# 2. Text encoder β€” projects "seed" to a 256-dim feature
text_features = text_sess.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
})[0]

# 3. Decoder β€” uses first 3 FPN levels + only the last position encoding
pred_masks, pred_boxes, pred_logits = decoder_sess.run(None, {
    "fpn_hidden_state_0": fpn_hidden_states[0],
    "fpn_hidden_state_1": fpn_hidden_states[1],
    "fpn_hidden_state_2": fpn_hidden_states[2],
    "fpn_position_encoding_2": fpn_position_encoding[2],
    "text_features": text_features,
    "attention_mask": attention_mask,
})

# Convert logits to scores in [0, 1]
scores = 1.0 / (1.0 + np.exp(-pred_logits))

# Filter by confidence
keep = scores[0] > 0.5
print(f"Detections: {keep.sum()}")
print(f"Boxes (xyxy normalized): {pred_boxes[0, keep]}")
print(f"Scores: {scores[0, keep]}")

A complete runnable version is in validate_sam3_e2e.py.

Usage β€” browser with onnxruntime-web

The same recipe in JavaScript. Note: transformers.js does not currently have a Sam3Model JS class for the image variant, so you call onnxruntime-web directly. You still need transformers.js for the tokenizer.

import { AutoTokenizer } from "@huggingface/transformers";
import * as ort from "onnxruntime-web";

const tokenizer = await AutoTokenizer.from_pretrained("facebook/sam3");

const visionSess = await ort.InferenceSession.create(
  "https://huggingface.co/danilobukvic/sam3-text-onnx/resolve/main/vision_encoder.onnx",
  { executionProviders: ["webgpu"] }
);
const textSess = await ort.InferenceSession.create(
  "https://huggingface.co/danilobukvic/sam3-text-onnx/resolve/main/text_encoder.onnx",
  { executionProviders: ["webgpu"] }
);
const decoderSess = await ort.InferenceSession.create(
  "https://huggingface.co/danilobukvic/sam3-text-onnx/resolve/main/decoder.onnx",
  { executionProviders: ["webgpu"] }
);

// Preprocess image to [1, 3, 1008, 1008] Float32Array with ImageNet normalization
// (see Sam3ImageProcessor for exact mean/std values β€” you'll need to port this)
const pixelValues = preprocessImage(imageBitmap);

// Tokenize prompt to length 32
const { input_ids, attention_mask } = await tokenizer("seed", {
  padding: "max_length",
  max_length: 32,
  truncation: true,
});

// Run the pipeline (same as Python above) ...

Input/output contracts

vision_encoder.onnx

Name Shape Type Notes
In pixel_values [batch, 3, 1008, 1008] float32 ImageNet-normalized. Size is fixed: SAM 3 precomputes positional embeddings for 1008Γ—1008.
Out fpn_hidden_state_0..3 [batch, 256, H, W] float32 Spatial scales: 288, 144, 72, 36 (for 1008 input)
Out fpn_position_encoding_0..3 [batch, 256, H, W] float32 Matching position encodings

text_encoder.onnx

Name Shape Type Notes
In input_ids [batch, 32] int64 CLIP-style tokens; SAM 3 uses max_position_embeddings=32 (shorter than standard CLIP's 77)
In attention_mask [batch, 32] int64 1 for real tokens, 0 for padding
Out text_features [batch, 32, 256] float32 Projected from CLIP's 1024-dim to SAM 3's 256-dim DETR space

decoder.onnx

Name Shape Type Notes
In fpn_hidden_state_0 [batch, 256, 288, 288] float32 From vision encoder (largest scale)
In fpn_hidden_state_1 [batch, 256, 144, 144] float32 From vision encoder
In fpn_hidden_state_2 [batch, 256, 72, 72] float32 From vision encoder (smallest scale)
In fpn_position_encoding_2 [batch, 256, 72, 72] float32 Only the smallest scale's PE is used (others were optimized out by the tracer)
In text_features [batch, 32, 256] float32 From text encoder
In attention_mask [batch, 32] int64 Same mask passed to text encoder
Out pred_masks [batch, 200, 288, 288] float32 200 query slots, each a low-resolution mask
Out pred_boxes [batch, 200, 4] float32 Boxes in xyxy format, normalized to [0, 1]
Out pred_logits [batch, 200] float32 Apply sigmoid for [0, 1] confidence scores

Which precision should I use?

Three precision variants are published. They share the same input/output contracts above β€” pick whichever matches your constraints.

Need… Use Why
Reference accuracy / research baseline fp32 Bit-exact match against PyTorch. Reproducible numbers.
Production server inference (CPU or GPU > 6 GB VRAM) int8 4Γ— smaller than fp32, same top detections, scores within 0.001 of baseline. AVX-512 VNNI and CUDA both accelerate int8 ops natively.
Browser / mobile / GPU with ≀ 4 GB VRAM int4 5Γ— smaller than fp32, keeps the right top-N detections, modest quality loss in low-confidence range.
GPU inference on a 4 GB consumer card int8 or int4 (not fp32) fp32 exhausts VRAM via intermediate Softmax buffers (~1.7 GB). int8 and int4 both fit.
Fastest CPU inference int8 Modern CPUs accelerate int8 GEMMs via AVX-VNNI; ~3Γ— faster than fp32 even on CPU.

Default recommendation: int8. It's the same quality as fp32 in practice but a quarter of the size and several times faster. Only use fp32 if you need bit-exact reproducibility for academic work. Only use int4 if you're shipping to a context where every megabyte matters (browsers, mobile, embedded).

Validation evidence

On a microscope-style seed image with prompt "seed" (the test image used during development):

Variant Detections > 0.5 Max score Top 5 scores
fp32 (baseline) 12 0.926 0.926, 0.889, 0.878, 0.874, 0.868
int8 12 0.925 0.925, 0.905, 0.902, 0.898, 0.872
int4 12 0.898 0.898, 0.863, 0.844, 0.797, 0.784

int8 is essentially indistinguishable from fp32. int4 keeps the right top detections but scores are ~3% lower with more noise in the low-confidence range.

Performance

Per-component timings for a single 1008Γ—1008 image. Measured on a laptop with a quad-core Intel CPU and an RTX 3050 Laptop (4 GB VRAM):

Variant vision (CPU) vision (GPU) text decoder (CPU) Total (best path)
fp32 181 s OOM on 4 GB 7 s 13 s 201 s (CPU only)
int8 53 s 27 s 1 s 14 s 42 s (hybrid)
int4 166 s 12 s 0.1 s 16 s 28 s (hybrid)

The GPU+CPU hybrid pattern

On consumer GPUs with limited VRAM, the vision encoder benefits massively from GPU (it's a heavy ViT) but the decoder runs better on CPU because its attention layers need ~860 MB intermediate buffers that don't fit alongside the encoder in VRAM. The hybrid setup:

vsess = ort.InferenceSession("vision_encoder_int4.onnx",
                              providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
tsess = ort.InferenceSession("text_encoder_int4.onnx",
                              providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
dsess = ort.InferenceSession("decoder_int4.onnx",
                              providers=["CPUExecutionProvider"])   # ← decoder on CPU

With this split, the int4 vision encoder runs in ~12 seconds on an RTX 3050 (vs. 166 seconds CPU-only) β€” a 14Γ— speedup on the bottleneck. Total inference drops from ~3 minutes to ~30 seconds.

On a server-class GPU (16+ GB VRAM) all three can run on GPU and inference is sub-second.

Prompt-iteration architecture

The decoder uses pre-computed vision features as inputs, so swapping the prompt while keeping the image only re-runs the cheap text + decoder path (~15 s end-to-end, vs ~30+ seconds for a fresh image). This makes interactive use viable.

Validation

The fp32 ONNX export was compared end-to-end against the original PyTorch Sam3Model.forward() on a microscope-style seed image with prompt "seed". Detection count, top scores, and box locations matched bit-for-bit within tracing noise.

The quantized variants (int8, int4) were then compared against the fp32 ONNX baseline β€” see the precision selection table for results.

To reproduce on your own image:

python validate_sam3_e2e.py path/to/image.png "seed"             # single variant
python validate_all_variants.py path/to/image.png "seed"         # all 3 side-by-side

Quantization results

Three precision variants are published in this repo. Actual measured sizes:

Component fp32 int8 (dynamic) int4 (MatMul block)
vision_encoder 1.84 GB 473 MB (3.7Γ—) 285 MB (6.5Γ—)
text_encoder 1.35 GB 340 MB (4.0Γ—) 350 MB (3.9Γ—)
decoder 96 MB 26 MB (3.6Γ—) 19 MB (4.9Γ—)
Total 3.3 GB 839 MB (4.0Γ—) 654 MB (5.0Γ—)

Notes on each variant

int8 β€” quantize_sam3_int8.py uses onnxruntime.quantization.quantize_dynamic with QuantType.QInt8. Per-tensor symmetric quantization of MatMul/Gemm weights. No calibration data needed. Outputs are bit-equivalent to fp32 in practice.

int4 β€” quantize_sam3_int4.py uses MatMulNBitsQuantizer with block_size=128, bits=4, is_symmetric=True. Only quantizes MatMul ops; other ops stay fp32. This is the same scheme HuggingFace uses with dtype: "q4" in transformers.js, and what onnx-community/sam3-tracker-ONNX uses for its vision encoder.

Note on text encoder size: int4 only saves a small amount over int8 for the text encoder because most of its size is in the CLIP token embedding table (a lookup matrix, not a MatMul). The block-wise int4 scheme targets MatMul weights specifically, so embeddings stay fp32 in both.

fp16 status

A fully-working fp16 variant is not currently published. onnxconverter_common.float16.convert_float_to_float16 produces type-mismatch errors when applied to the decoder β€” PyTorch's traced Cast ops create incompatible type boundaries that the converter doesn't propagate cleanly through. Blocking Cast/CastLike from conversion fixes the vision encoder but the decoder has additional Mul ops with the same problem.

For practical purposes, int8 dominates fp16 anyway: int8 is half the size with equivalent quality. fp16 has no real use case here. See quantize_sam3_fp16.py for the current (partially working) attempt if you want to try fixing it.

Known caveats and TODOs

  • Fixed input size: The vision encoder's positional embeddings are precomputed for 1008Γ—1008 input. Other sizes will produce shape mismatches. Use the SAM 3 image processor to handle resizing/padding.
  • Geometry prompts not supported: This export covers the text-only path. SAM 3's optional box/point prompts are not wired in β€” would need a separate decoder variant.
  • Dynamic batch is dynamic but untested: The export uses symbolic batch dim but I've only validated batch=1. Higher batch sizes should work but no guarantees.
  • fp16 export is broken: See fp16 status. int8 is the recommended half-precision alternative.
  • fp32 on 4 GB VRAM: The fp32 vision encoder's Softmax allocates a ~1.7 GB intermediate buffer; on consumer GPUs this exhausts VRAM. Use int8/int4 for GPU, or fp32 on CPU only.
  • Tracer warnings during export: A few TracerWarning: Converting a tensor to a Python boolean... were emitted. These bake config flags (e.g. is_causal, attention backend selection) into the graph at export time. Fine for inference with the same model config, but means the ONNX isn't reusable across configs.
  • No transformers.js integration: At time of writing, transformers.js only has Sam3TrackerModel (point/box) but not Sam3Model (text-promptable image). Until that lands, use onnxruntime-web directly.
  • Tracer optimized out unused inputs: The decoder ONNX accepts only fpn_position_encoding_2 (not _0 or _1) because those weren't actually used in the traced forward path. Inspect decoder_sess.get_inputs() for the canonical input names.

How this was built

Three Python scripts, one per sub-module, each calling torch.onnx.export on a thin wrapper around the SAM 3 component:

  1. export_sam3_vision.py β€” wraps Sam3VisionModel, flattens Sam3VisionEncoderOutput into a tuple of tensors. Uses the default (dynamo) exporter.
  2. export_sam3_text.py β€” wraps CLIPTextModelWithProjection + the text_projection Linear layer. Key gotcha: SAM 3's text config has max_position_embeddings=32 (shorter than standard CLIP's 77).
  3. export_sam3_decoder.py β€” wraps detr_encoder + detr_decoder + mask_decoder + dot_product_scoring + box head as a single pipeline. Uses dynamo=False (legacy tracer) β€” the dynamo exporter crashes on the attention reshape-after-transpose pattern (Cannot view a tensor with shape (1, 201, 8, 32) ... as a tensor with shape (s2, 201, 256)).

Quantization scripts (added after the initial export):

  • quantize_sam3_int8.py β€” onnxruntime.quantization.quantize_dynamic with QInt8 weight type. Works on all three components out of the box.
  • quantize_sam3_int4.py β€” MatMulNBitsQuantizer from onnxruntime.quantization.matmul_nbits_quantizer. block_size=128, bits=4, symmetric. Only quantizes MatMul ops; other ops stay fp32.
  • quantize_sam3_fp16.py β€” uses onnxconverter_common.float16.convert_float_to_float16 with op_block_list=["Cast", "CastLike"]. Fixes the vision encoder but the decoder still has type mismatches at Mul ops; partial fix only.

See validate_sam3_e2e.py and validate_all_variants.py for the end-to-end pipeline tests.

Reproducing this export

If you want to redo the export from scratch (e.g. to target a different opset, change the input size, or add geometry prompts), here's the full sequence.

Prerequisites

You need transformers v5 (which is in pre-release as of writing) and a build of optimum-onnx from source. The stable optimum releases require transformers v4 and don't know about SAM 3. Install from source with strict dependency control:

py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip

# Clone transformers from source β€” v5 is required for Sam3Model class
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install .[torch]
cd ..

# Upgrade huggingface_hub to v1+ (transformers v5 requires it; --no-deps avoids
# pulling old optimum back down)
pip install --upgrade --no-deps "huggingface_hub>=1.0"

# Install optimum-onnx from source with --no-deps to preserve transformers v5
pip install git+https://github.com/huggingface/optimum-onnx.git --no-deps

# Remaining deps
pip install onnx onnxruntime onnxscript onnx_ir pillow torchvision opencv-python --no-cache-dir

Verify the install:

python -c "from transformers import Sam3Model, Sam3Processor; print('SAM3 OK')"
python -c "from optimum.exporters.onnx import main_export; print('optimum OK')"

You'll also need a HuggingFace token to download facebook/sam3:

$env:HF_TOKEN = "hf_xxx"

Run the export

# fp32 export (~15 min total on CPU)
python export_sam3_vision.py          # ~5-10 min, produces vision_encoder.onnx + .data
python export_sam3_text.py            # ~2 min, produces text_encoder.onnx + .data
python export_sam3_decoder.py         # ~3 min, produces decoder.onnx

# Validate against PyTorch (~3 min, only do once)
python validate_sam3_e2e.py path/to/test-image.png "your prompt"

Quantize (optional but recommended)

python quantize_sam3_int8.py          # ~2 min, produces *_int8.onnx
python quantize_sam3_int4.py          # ~3 min, produces *_int4.onnx
python validate_all_variants.py path/to/test-image.png "your prompt"

Hardware notes

  • CPU is sufficient for both export and inference. GPU helps for fast iteration during inference but isn't needed for the export itself.
  • At least 8 GB system RAM is needed during the vision encoder export β€” the model is ~3.4 GB and the tracer holds intermediate state.
  • onnxruntime-gpu requires CUDA 12.x runtime libraries. If you have an NVIDIA driver supporting CUDA 12.x but no CUDA toolkit installed, pip install nvidia-cudnn-cu12 nvidia-cublas-cu12 nvidia-cuda-runtime-cu12 nvidia-cufft-cu12 brings the needed DLLs.

License

This export inherits the license of the original facebook/sam3 weights: the SAM 3 License Agreement. Read it before using these weights β€” it includes restrictions on commercial use.

Citation

The underlying model is Meta's SAM 3:

@article{ravi2025sam3,
  title   = {SAM 3: Segment Anything with Concepts},
  author  = {Ravi, Nikhila and others},
  journal = {arXiv preprint arXiv:2511.16719},
  year    = {2025}
}

If this ONNX export is useful to you, a star on the repo or a mention is appreciated.

Acknowledgments

  • Meta AI for releasing SAM 3 and its weights
  • HuggingFace for the transformers integration that made the architecture introspectable
  • PyTorch team for torch.onnx.export (especially the legacy dynamo=False path, which is what got the decoder across the line)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for danilobukvic/sam3-text-onnx

Base model

facebook/sam3
Quantized
(12)
this model

Paper for danilobukvic/sam3-text-onnx