Segment Anything — Mobile Encoder + SAM Mask Decoder Bundle (ONNX)

ONNX bundle pairing the MobileSAM ViT-T image encoder with Meta's original SAM mask decoder. Together they form a complete Segment Anything pipeline at ~60 MB total — orders of magnitude smaller than the original SAM ViT-H setup (~2.4 GB) and largely matched in accuracy on common segmentation tasks.

Not converted locally — these are the upstream-published ONNX checkpoints, bundled here for distribution stability.

Credit: Meta (SAM) and Kyungpook National University / Chaoning Zhang et al. (MobileSAM).

What this repo contains

File	Size	Role
`mobile_sam_image_encoder.onnx`	~27 MB	ViT-T (Tiny) image encoder; replaces SAM's ViT-H
`sam_mask_decoder_single.onnx`	~16 MB	Mask decoder, single-mask output mode
`sam_mask_decoder_multi.onnx`	~16 MB	Mask decoder, multi-mask output mode (returns top-3 masks per prompt)

The encoder runs once per image to produce a per-pixel embedding (1024-dim feature map). The mask decoder then runs cheaply per prompt (a click, box, or mask hint) against that embedding to produce segmentation masks. This split is what makes interactive segmentation tractable — only the decoder runs in the per-prompt loop.

How to use

import onnxruntime as ort
import numpy as np

# 1. Encode the image once
encoder = ort.InferenceSession("mobile_sam_image_encoder.onnx")
img_embedding = encoder.run(None, {"input_image": preprocessed_image})[0]
# img_embedding shape: [1, 256, 64, 64]

# 2. Decode masks per prompt (point click here)
decoder = ort.InferenceSession("sam_mask_decoder_single.onnx")
point_coords = np.array([[[500, 375]]], dtype=np.float32)  # one click
point_labels = np.array([[1]], dtype=np.float32)           # 1 = positive

masks, iou_predictions, low_res_masks = decoder.run(None, {
    "image_embeddings":  img_embedding,
    "point_coords":      point_coords,
    "point_labels":      point_labels,
    "mask_input":        np.zeros((1, 1, 256, 256), dtype=np.float32),
    "has_mask_input":    np.array([0], dtype=np.float32),
    "orig_im_size":      np.array([img_h, img_w], dtype=np.float32),
})

Use sam_mask_decoder_single.onnx for "give me the single best mask for this prompt" (typical interactive UX). Use sam_mask_decoder_multi.onnx for "give me top-3 candidate masks" — useful when the prompt is ambiguous (e.g., clicking on a person's shirt could yield mask of the shirt, the torso, or the whole person).

Why MobileSAM over original SAM

38× smaller encoder: ViT-T (~~10M params) vs SAM ViT-H (~~636M params).
~5× faster inference on CPU; near-real-time even without GPU.
Quality is comparable on common segmentation benchmarks — MobileSAM was distilled from SAM ViT-H specifically to preserve quality.
Choose original SAM ViT-H/L only when maximum accuracy on edge cases matters more than speed/size.

License

Apache-2.0 — same as both upstream projects. LICENSE file included.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Heliosoph/sam-onnx

Base model

facebook/sam-vit-base

Quantized

(2)

this model