SAM3 Browser INT8 — Quantized ONNX Models

INT8-quantized ONNX models for running SAM3 (Segment Anything Model 3) entirely in the browser via ONNX Runtime Web.

Files

File	Size	Description
`sam3_image_encoder.onnx`	466 MB	ViT backbone — encodes the input image into feature maps
`sam3_language_encoder.onnx`	387 MB	CLIP text encoder — converts text prompts into embeddings
`sam3_decoder.onnx`	35 MB	DETR-style decoder — produces boxes, scores, and pixel masks
`clip_tokenizer.json`	1.5 MB	CLIP BPE tokenizer vocabulary (encoder + merge table + byte encoder)
Total	~889 MB

Tokenizer

clip_tokenizer.json contains the full CLIP BPE tokenizer data needed to tokenize text prompts for the language encoder. It includes:

encoder: BPE token → integer ID mapping (~49,408 entries)
merges: BPE merge rules (~48,894 pairs)
byte_encoder: byte-to-unicode mapping for UTF-8 handling

This is extracted from OpenAI's CLIP bpe_simple_vocab_16e6.txt.gz and packaged as JSON for browser use. Load it at runtime (~500 KB gzipped by CDN) and use it to tokenize prompts into int64 token sequences of length 32, padded with [START=49406, ...tokens..., END=49407, 0, 0, ...].

Quantization

Dynamic INT8 quantization via onnxruntime.quantization.quantize_dynamic with QUInt8 weights. Original FP32 models totaled ~3.5 GB.

Quality is preserved: the quantized pipeline scores 0.9495 on a test image (vs 0.9471 for FP32).

Usage

These models are designed for browser inference. Load them with ONNX Runtime Web:

Pipeline

Tokenizer: text prompt → clip_tokenizer.json → int64[1, 32] (CLIP BPE tokens)
Image encoder: input image (uint8, shape [3, 1008, 1008]) → 6 output tensors (vision pos encodings + backbone FPN features)
Language encoder: input tokens (int64, shape [1, 32]) → text_attention_mask, text_memory, text_embeds
Decoder: combines encoder outputs + prompt tensors → boxes, scores, masks

Decoder inputs

Input	Type	Shape	Source
`original_height`	int64	scalar	Original image height
`original_width`	int64	scalar	Original image width
`vision_pos_enc_2`	float32	from encoder	Image encoder output
`backbone_fpn_0/1/2`	float32	from encoder	Image encoder outputs
`language_mask`	float32	from lang encoder	= `text_attention_mask`
`language_features`	float32	from lang encoder	= `text_memory`
`box_coords`	float32	`[1, 1, 4]`	Zeros for text-only prompting
`box_labels`	int64	`[1, 1]`	Ones
`box_masks`	bool	`[1, 1]`	Ones

Performance

Environment	Score	Total time
Python FP32 (CPU)	0.9471	7.6s
Python INT8 (CPU)	0.9495	4.5s
Browser WASM	0.9402	94.7s
Browser WebGPU	~0.94 (est.)	~6-18s (est.)

Source

Original SAM3 ONNX models from vietanhdev/segment-anything-3-onnx-models, quantized for browser deployment.

Live demo

rusen.ai/demos/segment-anything

Downloads last month: -; Downloads are not tracked for this model. How to track