SAM3 Browser INT8 β€” Quantized ONNX Models

INT8-quantized ONNX models for running SAM3 (Segment Anything Model 3) entirely in the browser via ONNX Runtime Web.

Files

File Size Description
sam3_image_encoder.onnx 466 MB ViT backbone β€” encodes the input image into feature maps
sam3_language_encoder.onnx 387 MB CLIP text encoder β€” converts text prompts into embeddings
sam3_decoder.onnx 35 MB DETR-style decoder β€” produces boxes, scores, and pixel masks
clip_tokenizer.json 1.5 MB CLIP BPE tokenizer vocabulary (encoder + merge table + byte encoder)
Total ~889 MB

Tokenizer

clip_tokenizer.json contains the full CLIP BPE tokenizer data needed to tokenize text prompts for the language encoder. It includes:

  • encoder: BPE token β†’ integer ID mapping (~49,408 entries)
  • merges: BPE merge rules (~48,894 pairs)
  • byte_encoder: byte-to-unicode mapping for UTF-8 handling

This is extracted from OpenAI's CLIP bpe_simple_vocab_16e6.txt.gz and packaged as JSON for browser use. Load it at runtime (~500 KB gzipped by CDN) and use it to tokenize prompts into int64 token sequences of length 32, padded with [START=49406, ...tokens..., END=49407, 0, 0, ...].

Quantization

Dynamic INT8 quantization via onnxruntime.quantization.quantize_dynamic with QUInt8 weights. Original FP32 models totaled ~3.5 GB.

Quality is preserved: the quantized pipeline scores 0.9495 on a test image (vs 0.9471 for FP32).

Usage

These models are designed for browser inference. Load them with ONNX Runtime Web:

Pipeline

  1. Tokenizer: text prompt β†’ clip_tokenizer.json β†’ int64[1, 32] (CLIP BPE tokens)
  2. Image encoder: input image (uint8, shape [3, 1008, 1008]) β†’ 6 output tensors (vision pos encodings + backbone FPN features)
  3. Language encoder: input tokens (int64, shape [1, 32]) β†’ text_attention_mask, text_memory, text_embeds
  4. Decoder: combines encoder outputs + prompt tensors β†’ boxes, scores, masks

Decoder inputs

Input Type Shape Source
original_height int64 scalar Original image height
original_width int64 scalar Original image width
vision_pos_enc_2 float32 from encoder Image encoder output
backbone_fpn_0/1/2 float32 from encoder Image encoder outputs
language_mask float32 from lang encoder = text_attention_mask
language_features float32 from lang encoder = text_memory
box_coords float32 [1, 1, 4] Zeros for text-only prompting
box_labels int64 [1, 1] Ones
box_masks bool [1, 1] Ones

Performance

Environment Score Total time
Python FP32 (CPU) 0.9471 7.6s
Python INT8 (CPU) 0.9495 4.5s
Browser WASM 0.9402 94.7s
Browser WebGPU ~0.94 (est.) ~6-18s (est.)

Source

Original SAM3 ONNX models from vietanhdev/segment-anything-3-onnx-models, quantized for browser deployment.

Live demo

rusen.ai/demos/segment-anything

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support