Instructions to use Geo-IA/evo-sam2.1-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sam2
How to use Geo-IA/evo-sam2.1-onnx with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(Geo-IA/evo-sam2.1-onnx) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(Geo-IA/evo-sam2.1-onnx) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
Evo · SAM 2.1 (Hiera) ONNX — browser-ready, interactive refinement
Self-contained ONNX exports of Meta's SAM 2.1 image encoder + prompt/mask
decoder, packaged for onnxruntime-web (no .onnx_data external-data files,
which the browser runtime cannot mount).
One directory per model size (sam2.1_hiera_{tiny,small,base_plus,large}):
| file | contents |
|---|---|
vision_encoder.onnx |
image encoder, fp32 (merged self-contained from the onnx-community export) |
vision_encoder_fp16.onnx |
same, fp16 weights / fp32 I/O — for WebGPU |
prompt_encoder_mask_decoder.onnx |
prompt encoder + mask decoder, fp32, re-exported with the mask-refinement inputs |
prompt_encoder_mask_decoder_fp16.onnx |
same, fp16 weights / fp32 I/O — for WebGPU |
Why a re-exported decoder?
The stock onnx-community decoder export drops the prompt encoder's mask path:
no input_masks / has_mask_input inputs and no single-mask token. SAM is
trained to be used iteratively — each refinement click feeds the previous
low-res logits back as a prior and reads the dedicated refinement token.
These decoders restore that contract:
- inputs:
input_points[1,1,N,2](1024-px space, float32),input_labels[1,1,N](int64),image_embeddings.{0,1,2},input_masks[1,1,256,256],has_mask_input[1] - outputs:
iou_scores[1,4],pred_masks[1,4,256,256](logits, ±32),object_score_logits[1,1] pred_masksslots: 0 = single-mask/refinement output (officialmultimask_output=Falsepath incl. stability fallback); 1-3 = multimask hypotheses.
Exported from the official Meta checkpoints (2024-09-24 release) with
scripts/export_sam21_decoder_mask.py —
logit parity 0.0000 vs SAM2ImagePredictor (cold, single-mask and
refinement decodes) verified for every size at export time. Encoders are
byte-identical migrations of the previously published merged files and consume
nothing new — the decoder is a drop-in swap.
fp16 variants keep fp32 I/O (weights-only conversion, with Cast fixes for the label-cast nodes); intended for WebGPU. On WASM prefer fp32 — fp16 there is emulated and slower.