Efficient Universal Perception Encoder
Paper β’ 2603.22387 β’ Published β’ 8
ONNX exports of Meta AI's Efficient Universal Perception Encoder (EUPE) β a single lightweight vision backbone that matches or exceeds domain-specialist models on diverse tasks, designed for on-device / edge deployment.
| File | Architecture | Params | Size | Type |
|---|---|---|---|---|
eupe_vitt16.onnx |
ViT-T/16 | 6M | 22.1 MB | FP32 |
eupe_vitt16_int8.onnx |
ViT-T/16 | 6M | 5.9 MB | INT8 β smallest |
eupe_vits16.onnx |
ViT-S/16 | 21M | 86.6 MB | FP32 |
eupe_vits16_int8.onnx |
ViT-S/16 | 21M | 22.2 MB | INT8 |
eupe_vitb16.onnx |
ViT-B/16 | 86M | 342.8 MB | FP32 |
eupe_vitb16_int8.onnx |
ViT-B/16 | 86M | 86.4 MB | INT8 |
eupe_convnext-tiny.onnx |
ConvNeXt-T | 29M | 111.4 MB | FP32 |
eupe_convnext-tiny_int8.onnx |
ConvNeXt-T | 29M | 28.2 MB | INT8 |
eupe_convnext-small.onnx |
ConvNeXt-S | 50M | 198.0 MB | FP32 |
eupe_convnext-small_int8.onnx |
ConvNeXt-S | 50M | 50.2 MB | INT8 |
eupe_convnext-base.onnx |
ConvNeXt-B | 89M | 350.4 MB | FP32 |
eupe_convnext-base_int8.onnx |
ConvNeXt-B | 89M | 88.5 MB | INT8 |
INT8 models are ~75% smaller than FP32 with negligible accuracy loss.
vitt16, vits16, vitb16)
| Name | Shape | dtype | |
|---|---|---|---|
| Input | input |
[batch, 3, 224, 224] |
float32 |
| Output 0 | cls_token |
[batch, D] |
float32 |
| Output 1 | patch_tokens |
[batch, 196, D] |
float32 |
Where D = 192 (ViT-T) / 384 (ViT-S) / 768 (ViT-B)
convnext-tiny, convnext-small, convnext-base)
| Name | Shape | dtype | |
|---|---|---|---|
| Input | input |
[batch, 3, 224, 224] |
float32 |
| Output | features |
[batch, D] |
float32 |
Where D = 768 (Tiny/Small) / 1024 (Base)
Preprocessing: ImageNet normalisation β mean
[0.485, 0.456, 0.406], std[0.229, 0.224, 0.225]
import onnxruntime as ort
import numpy as np
from PIL import Image
from torchvision.transforms import v2
import torch
# Load model
sess = ort.InferenceSession(
"eupe_vitt16_int8.onnx",
providers=["CPUExecutionProvider"]
)
# Preprocess image
transform = v2.Compose([
v2.ToImage(),
v2.Resize((224, 224), antialias=True),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
img = Image.open("image.jpg").convert("RGB")
inp = transform(img)[None].numpy() # (1, 3, 224, 224)
# Inference
cls_token, patch_tokens = sess.run(None, {"input": inp})
print(cls_token.shape) # (1, 192)
print(patch_tokens.shape) # (1, 196, 192)
| Task | How |
|---|---|
| Image classification | k-NN on cls_token |
| Image similarity / retrieval | Cosine similarity of cls_token |
| Depth estimation | Linear layer on patch_tokens |
| Semantic segmentation | Linear layer on patch_tokens |
| Visual QA | Feed patch_tokens into a language model |
| Platform | Recommended |
|---|---|
| Android / iOS | INT8 + ONNX Runtime Mobile |
| Raspberry Pi / Jetson Nano | INT8 |
| Browser | FP32 via onnxruntime-web |
| Server / Cloud | FP32 |
torch.onnx.export (legacy TorchScript path, opset 16)onnxruntime.quantization.quantize_dynamic (QInt8 weights)FAIR Noncommercial Research License. See original repo for details.