EfficientSAM3 OpenVINO
Pre-exported OpenVINO IR models for EfficientSAM3 enabling fast zero-shot and few-shot object detection on Intel hardware. Supports multiple backbone variants.
Backbones
| Backbone | Parameters | Notes |
|---|---|---|
| EfficientViT-B1 | ~50M | Default backbone, best overall accuracy |
| RepViT-M1.1 | ~30M | Lighter backbone, GPU-friendly (no NaN in FP16 vision encoder) |
Variants
| Variant | Description |
|---|---|
| openvino-fp16 | FP16 weights, highest accuracy baseline |
| openvino-int8_sym | INT8 symmetric weight-only compression (~47% smaller, ~10% faster) |
| openvino-int8_asym | INT8 asymmetric weight-only compression |
| openvino-int8_ptq_gpu | Full W8A8 PTQ with calibration (1.5-1.7x faster, GPU-targeted) |
| onnx | Original ONNX exports (auto-converted by OV runtime) |
Benchmark Results (Classic Mode)
EfficientViT-B1
Intel B60 GPU (BMG-G31) + CPU decoder
| Dataset | FP16 | INT8_SYM | INT8_PTQ | F1 (PTQ) |
|---|---|---|---|---|
| Potatoes (10 img) | 822 ms | 840 ms | 536 ms | 1.000 |
| Nuts (21 img) | 1531 ms | 1399 ms | 925 ms | 0.615 |
| Candies (12 img) | 829 ms | 765 ms | 543 ms | 0.994 |
Intel 12900K CPU
| Dataset | FP16 | INT8_SYM | INT8_PTQ | F1 (PTQ) |
|---|---|---|---|---|
| Potatoes | 1095 ms | 1028 ms | 634 ms | 1.000 |
| Nuts | 1751 ms | 1610 ms | 1056 ms | 0.575 |
| Candies | 1083 ms | 1044 ms | 672 ms | 0.927 |
RepViT-M1.1
Intel B60 GPU (BMG-G31) + CPU decoder
| Dataset | FP16 | INT8_SYM | INT8_ASYM | INT8_PTQ | F1 (PTQ) |
|---|---|---|---|---|---|
| Potatoes (10 img) | 852 ms | 801 ms | 766 ms | 533 ms | 1.000 |
| Nuts (21 img) | 1511 ms | 1449 ms | 1408 ms | 923 ms | 0.632 |
| Candies (12 img) | 839 ms | 770 ms | 768 ms | 518 ms | 0.994 |
Intel 12900K CPU
| Dataset | FP16 | INT8_SYM | INT8_ASYM | INT8_PTQ | F1 (PTQ) |
|---|---|---|---|---|---|
| Potatoes | 1065 ms | 1048 ms | 1082 ms | 642 ms | 1.000 |
| Nuts | 1770 ms | 1671 ms | 1686 ms | 1056 ms | 0.575 |
| Candies | 1082 ms | 1035 ms | 1026 ms | 652 ms | 0.927 |
Note: GPU mode uses a hybrid configuration (vision/text/geometry encoders on GPU with f32 precision hint, prompt-decoder on CPU) to work around Intel GPU plugin numerical issues.
Sub-models (5-model split)
Each backbone produces the same 5 sub-models:
| Sub-model | Purpose |
|---|---|
| vision-encoder | Backbone + FPN feature extraction |
| text-encoder | MobileCLIP-S1 + projection for text prompts |
| geometry-encoder | Classic box/point prompt encoding |
| geometry-encoder-exemplar | Visual exemplar prompt encoding |
| prompt-decoder | DETR encoder/decoder + box head + scoring |
Intel GPU Workarounds
When running on Intel GPUs (Arc/Xe/Battlemage), two workarounds are automatically applied:
- FP16 overflow: Some sub-models produce NaN/garbage in FP16 on GPU. Fixed by compiling with
INFERENCE_PRECISION_HINT=f32. - Decoder GPU numerical drift: Prompt-decoder logits drift vs CPU regardless of precision. Fixed by running decoder on CPU (~60 MB, minimal latency impact).
These are transparent to users — smart defaults handle everything automatically.
Requirements
- OpenVINO >= 2025.3.0
- NNCF >= 3.1.0 (for PTQ only)
- transformers (for CLIP tokenizer)
License
Apache-2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support