EfficientSAM3 OpenVINO

Pre-exported OpenVINO IR models for EfficientSAM3 enabling fast zero-shot and few-shot object detection on Intel hardware. Supports multiple backbone variants.

Backbones

Backbone	Parameters	Notes
EfficientViT-B1	~50M	Default backbone, best overall accuracy
RepViT-M1.1	~30M	Lighter backbone, GPU-friendly (no NaN in FP16 vision encoder)

Variants

Variant	Description
openvino-fp16	FP16 weights, highest accuracy baseline
openvino-int8_sym	INT8 symmetric weight-only compression (~47% smaller, ~10% faster)
openvino-int8_asym	INT8 asymmetric weight-only compression
openvino-int8_ptq_gpu	Full W8A8 PTQ with calibration (1.5-1.7x faster, GPU-targeted)
onnx	Original ONNX exports (auto-converted by OV runtime)

Benchmark Results (Classic Mode)

EfficientViT-B1

Intel B60 GPU (BMG-G31) + CPU decoder

Dataset	FP16	INT8_SYM	INT8_PTQ	F1 (PTQ)
Potatoes (10 img)	822 ms	840 ms	536 ms	1.000
Nuts (21 img)	1531 ms	1399 ms	925 ms	0.615
Candies (12 img)	829 ms	765 ms	543 ms	0.994

Intel 12900K CPU

Dataset	FP16	INT8_SYM	INT8_PTQ	F1 (PTQ)
Potatoes	1095 ms	1028 ms	634 ms	1.000
Nuts	1751 ms	1610 ms	1056 ms	0.575
Candies	1083 ms	1044 ms	672 ms	0.927

RepViT-M1.1

Intel B60 GPU (BMG-G31) + CPU decoder

Dataset	FP16	INT8_SYM	INT8_ASYM	INT8_PTQ	F1 (PTQ)
Potatoes (10 img)	852 ms	801 ms	766 ms	533 ms	1.000
Nuts (21 img)	1511 ms	1449 ms	1408 ms	923 ms	0.632
Candies (12 img)	839 ms	770 ms	768 ms	518 ms	0.994

Intel 12900K CPU

Dataset	FP16	INT8_SYM	INT8_ASYM	INT8_PTQ	F1 (PTQ)
Potatoes	1065 ms	1048 ms	1082 ms	642 ms	1.000
Nuts	1770 ms	1671 ms	1686 ms	1056 ms	0.575
Candies	1082 ms	1035 ms	1026 ms	652 ms	0.927

Note: GPU mode uses a hybrid configuration (vision/text/geometry encoders on GPU with f32 precision hint, prompt-decoder on CPU) to work around Intel GPU plugin numerical issues.

Sub-models (5-model split)

Each backbone produces the same 5 sub-models:

Sub-model	Purpose
vision-encoder	Backbone + FPN feature extraction
text-encoder	MobileCLIP-S1 + projection for text prompts
geometry-encoder	Classic box/point prompt encoding
geometry-encoder-exemplar	Visual exemplar prompt encoding
prompt-decoder	DETR encoder/decoder + box head + scoring

Intel GPU Workarounds

When running on Intel GPUs (Arc/Xe/Battlemage), two workarounds are automatically applied:

FP16 overflow: Some sub-models produce NaN/garbage in FP16 on GPU. Fixed by compiling with INFERENCE_PRECISION_HINT=f32.
Decoder GPU numerical drift: Prompt-decoder logits drift vs CPU regardless of precision. Fixed by running decoder on CPU (~60 MB, minimal latency impact).

These are transparent to users — smart defaults handle everything automatically.

Requirements

OpenVINO >= 2025.3.0
NNCF >= 3.1.0 (for PTQ only)
transformers (for CLIP tokenizer)

License

Apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Zero-Shot Object Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support