Segment Anything β€” Mobile Encoder + SAM Mask Decoder Bundle (ONNX)

ONNX bundle pairing the MobileSAM ViT-T image encoder with Meta's original SAM mask decoder. Together they form a complete Segment Anything pipeline at ~60 MB total β€” orders of magnitude smaller than the original SAM ViT-H setup (~2.4 GB) and largely matched in accuracy on common segmentation tasks.

Not converted locally β€” these are the upstream-published ONNX checkpoints, bundled here for distribution stability.

Credit: Meta (SAM) and Kyungpook National University / Chaoning Zhang et al. (MobileSAM).

What this repo contains

File Size Role
mobile_sam_image_encoder.onnx ~27 MB ViT-T (Tiny) image encoder; replaces SAM's ViT-H
sam_mask_decoder_single.onnx ~16 MB Mask decoder, single-mask output mode
sam_mask_decoder_multi.onnx ~16 MB Mask decoder, multi-mask output mode (returns top-3 masks per prompt)

The encoder runs once per image to produce a per-pixel embedding (1024-dim feature map). The mask decoder then runs cheaply per prompt (a click, box, or mask hint) against that embedding to produce segmentation masks. This split is what makes interactive segmentation tractable β€” only the decoder runs in the per-prompt loop.

How to use

import onnxruntime as ort
import numpy as np

# 1. Encode the image once
encoder = ort.InferenceSession("mobile_sam_image_encoder.onnx")
img_embedding = encoder.run(None, {"input_image": preprocessed_image})[0]
# img_embedding shape: [1, 256, 64, 64]

# 2. Decode masks per prompt (point click here)
decoder = ort.InferenceSession("sam_mask_decoder_single.onnx")
point_coords = np.array([[[500, 375]]], dtype=np.float32)  # one click
point_labels = np.array([[1]], dtype=np.float32)           # 1 = positive

masks, iou_predictions, low_res_masks = decoder.run(None, {
    "image_embeddings":  img_embedding,
    "point_coords":      point_coords,
    "point_labels":      point_labels,
    "mask_input":        np.zeros((1, 1, 256, 256), dtype=np.float32),
    "has_mask_input":    np.array([0], dtype=np.float32),
    "orig_im_size":      np.array([img_h, img_w], dtype=np.float32),
})

Use sam_mask_decoder_single.onnx for "give me the single best mask for this prompt" (typical interactive UX). Use sam_mask_decoder_multi.onnx for "give me top-3 candidate masks" β€” useful when the prompt is ambiguous (e.g., clicking on a person's shirt could yield mask of the shirt, the torso, or the whole person).

Why MobileSAM over original SAM

  • 38Γ— smaller encoder: ViT-T (10M params) vs SAM ViT-H (636M params).
  • ~5Γ— faster inference on CPU; near-real-time even without GPU.
  • Quality is comparable on common segmentation benchmarks β€” MobileSAM was distilled from SAM ViT-H specifically to preserve quality.
  • Choose original SAM ViT-H/L only when maximum accuracy on edge cases matters more than speed/size.

License

Apache-2.0 β€” same as both upstream projects. LICENSE file included.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Heliosoph/sam-onnx

Quantized
(2)
this model