This is a duplicate of rectlabel/segment-anything-onnx-models (sam3_v2.zip) for GAME-Map2Mask-exploration.html, unzipped so each file is directly downloadable. Original export by RectLabel.
SAM3 v2 ONNX (quantized uint8)
Text-prompted image segmentation. Give it an image and a word like "road" - it gives you a binary mask. No box prompts or point clicks needed.
Quantized with onnxruntime.quantization (QUInt8, per-tensor, dynamic). Exported using sam3-cpp-macos/export_v2.py at 1008x1008 resolution.
Files
| File | Size | What it does |
|---|---|---|
vision-encoder.onnx |
510 MB | Image in, FPN features out |
text-encoder.onnx |
339 MB | Token IDs in, text features out |
decoder.onnx |
35 MB | Features + text + box in, masks out |
tokenizer.json |
3.5 MB | CLIP tokenizer vocab |
Total: ~887 MB
How to use (Python, onnxruntime)
import onnxruntime as ort, numpy as np, json
from PIL import Image
# Load sessions
vis = ort.InferenceSession("vision-encoder.onnx")
txt = ort.InferenceSession("text-encoder.onnx")
dec = ort.InferenceSession("decoder.onnx")
# Preprocess image (1008x1008, ImageNet normalize)
img = np.array(Image.open("photo.jpg").convert("RGB").resize((1008, 1008))).astype(np.float32)
img = ((img - [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375]).transpose(2, 0, 1)[None]
# Tokenize (CLIP-style, "road" = [49406, 1759, 49407])
ids = np.zeros((1, 32), dtype=np.int64)
mask = np.zeros((1, 32), dtype=np.int64)
ids[0, :3] = [49406, 1759, 49407] # BOS, road, EOS
mask[0, :3] = 1
# Run pipeline
vis_out = vis.run(None, {"images": img.astype(np.float32)})
vis_map = dict(zip([o.name for o in vis.get_outputs()], vis_out))
txt_out = txt.run(None, {"input_ids": ids, "attention_mask": mask})
txt_map = dict(zip([o.name for o in txt.get_outputs()], txt_out))
dec_out = dec.run(None, {
**{k: vis_map[k] for k in ["fpn_feat_0", "fpn_feat_1", "fpn_feat_2", "fpn_pos_2"]},
"text_features": txt_map["text_features"],
"text_mask": txt_map["text_mask"],
"input_boxes": np.array([[[0, 0, 1, 1]]], dtype=np.float32),
"input_boxes_labels": np.array([[1]], dtype=np.int64),
})
# Best mask
logits = dec_out[2][0]
binary_mask = dec_out[0][0, logits.argmax()] > 0
How to use (browser, onnxruntime-web)
Works with onnxruntime-web WASM or WebGPU. Load each .onnx file via ort.InferenceSession.create(url). Same pipeline as above but with ort.Tensor instead of numpy arrays. Token IDs need BigInt64Array for int64 support.
Model I/O shapes
vision-encoder.onnx
- Input:
images[batch, 3, 1008, 1008] float32 - Output:
fpn_feat_0[batch, 256, 288, 288],fpn_feat_1[batch, 256, 144, 144],fpn_feat_2[batch, 256, 72, 72],fpn_pos_2[batch, 256, 72, 72]
text-encoder.onnx
- Input:
input_ids[batch, 32] int64,attention_mask[batch, 32] int64 - Output:
text_features[batch, 32, 256] float32,text_mask[batch, 32] bool
decoder.onnx
- Input: all vision outputs + text outputs +
input_boxes[batch, N, 4] float32 +input_boxes_labels[batch, N] int64 - Output:
pred_masks[batch, M, H, W] float32,pred_boxes[batch, 200, 4],pred_logits[batch, M],presence_logits[batch, M]
Use input_boxes = [[[0,0,1,1]]] with input_boxes_labels = [[1]] for full-image text-prompted segmentation.
Source
- Base model: facebook/sam3
- ONNX export script: ryouchinsa/sam3-cpp-macos
- Original ZIP: rectlabel/segment-anything-onnx-models (sam3_v2.zip)
Model tree for Luminia/sam3-v2-onnx
Base model
facebook/sam3