This is a duplicate of rectlabel/segment-anything-onnx-models (sam3_v2.zip) for GAME-Map2Mask-exploration.html, unzipped so each file is directly downloadable. Original export by RectLabel.

SAM3 v2 ONNX (quantized uint8)

Text-prompted image segmentation. Give it an image and a word like "road" - it gives you a binary mask. No box prompts or point clicks needed.

Quantized with onnxruntime.quantization (QUInt8, per-tensor, dynamic). Exported using sam3-cpp-macos/export_v2.py at 1008x1008 resolution.

Files

File Size What it does
vision-encoder.onnx 510 MB Image in, FPN features out
text-encoder.onnx 339 MB Token IDs in, text features out
decoder.onnx 35 MB Features + text + box in, masks out
tokenizer.json 3.5 MB CLIP tokenizer vocab

Total: ~887 MB

How to use (Python, onnxruntime)

import onnxruntime as ort, numpy as np, json
from PIL import Image

# Load sessions
vis = ort.InferenceSession("vision-encoder.onnx")
txt = ort.InferenceSession("text-encoder.onnx")
dec = ort.InferenceSession("decoder.onnx")

# Preprocess image (1008x1008, ImageNet normalize)
img = np.array(Image.open("photo.jpg").convert("RGB").resize((1008, 1008))).astype(np.float32)
img = ((img - [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375]).transpose(2, 0, 1)[None]

# Tokenize (CLIP-style, "road" = [49406, 1759, 49407])
ids = np.zeros((1, 32), dtype=np.int64)
mask = np.zeros((1, 32), dtype=np.int64)
ids[0, :3] = [49406, 1759, 49407]  # BOS, road, EOS
mask[0, :3] = 1

# Run pipeline
vis_out = vis.run(None, {"images": img.astype(np.float32)})
vis_map = dict(zip([o.name for o in vis.get_outputs()], vis_out))

txt_out = txt.run(None, {"input_ids": ids, "attention_mask": mask})
txt_map = dict(zip([o.name for o in txt.get_outputs()], txt_out))

dec_out = dec.run(None, {
    **{k: vis_map[k] for k in ["fpn_feat_0", "fpn_feat_1", "fpn_feat_2", "fpn_pos_2"]},
    "text_features": txt_map["text_features"],
    "text_mask": txt_map["text_mask"],
    "input_boxes": np.array([[[0, 0, 1, 1]]], dtype=np.float32),
    "input_boxes_labels": np.array([[1]], dtype=np.int64),
})

# Best mask
logits = dec_out[2][0]
binary_mask = dec_out[0][0, logits.argmax()] > 0

How to use (browser, onnxruntime-web)

Works with onnxruntime-web WASM or WebGPU. Load each .onnx file via ort.InferenceSession.create(url). Same pipeline as above but with ort.Tensor instead of numpy arrays. Token IDs need BigInt64Array for int64 support.

Model I/O shapes

vision-encoder.onnx

  • Input: images [batch, 3, 1008, 1008] float32
  • Output: fpn_feat_0 [batch, 256, 288, 288], fpn_feat_1 [batch, 256, 144, 144], fpn_feat_2 [batch, 256, 72, 72], fpn_pos_2 [batch, 256, 72, 72]

text-encoder.onnx

  • Input: input_ids [batch, 32] int64, attention_mask [batch, 32] int64
  • Output: text_features [batch, 32, 256] float32, text_mask [batch, 32] bool

decoder.onnx

  • Input: all vision outputs + text outputs + input_boxes [batch, N, 4] float32 + input_boxes_labels [batch, N] int64
  • Output: pred_masks [batch, M, H, W] float32, pred_boxes [batch, 200, 4], pred_logits [batch, M], presence_logits [batch, M]

Use input_boxes = [[[0,0,1,1]]] with input_boxes_labels = [[1]] for full-image text-prompted segmentation.

Source

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luminia/sam3-v2-onnx

Base model

facebook/sam3
Quantized
(11)
this model