EUPE β€” All 6 Models in ONNX (FP32 + INT8)

ONNX exports of Meta AI's Efficient Universal Perception Encoder (EUPE) β€” a single lightweight vision backbone that matches or exceeds domain-specialist models on diverse tasks, designed for on-device / edge deployment.


Files

File Architecture Params Size Type
eupe_vitt16.onnx ViT-T/16 6M 22.1 MB FP32
eupe_vitt16_int8.onnx ViT-T/16 6M 5.9 MB INT8 βœ… smallest
eupe_vits16.onnx ViT-S/16 21M 86.6 MB FP32
eupe_vits16_int8.onnx ViT-S/16 21M 22.2 MB INT8
eupe_vitb16.onnx ViT-B/16 86M 342.8 MB FP32
eupe_vitb16_int8.onnx ViT-B/16 86M 86.4 MB INT8
eupe_convnext-tiny.onnx ConvNeXt-T 29M 111.4 MB FP32
eupe_convnext-tiny_int8.onnx ConvNeXt-T 29M 28.2 MB INT8
eupe_convnext-small.onnx ConvNeXt-S 50M 198.0 MB FP32
eupe_convnext-small_int8.onnx ConvNeXt-S 50M 50.2 MB INT8
eupe_convnext-base.onnx ConvNeXt-B 89M 350.4 MB FP32
eupe_convnext-base_int8.onnx ConvNeXt-B 89M 88.5 MB INT8

INT8 models are ~75% smaller than FP32 with negligible accuracy loss.


Inputs & Outputs

ViT models (vitt16, vits16, vitb16)

Name Shape dtype
Input input [batch, 3, 224, 224] float32
Output 0 cls_token [batch, D] float32
Output 1 patch_tokens [batch, 196, D] float32

Where D = 192 (ViT-T) / 384 (ViT-S) / 768 (ViT-B)

ConvNeXt models (convnext-tiny, convnext-small, convnext-base)

Name Shape dtype
Input input [batch, 3, 224, 224] float32
Output features [batch, D] float32

Where D = 768 (Tiny/Small) / 1024 (Base)

Preprocessing: ImageNet normalisation β€” mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225]


Quick Start

import onnxruntime as ort
import numpy as np
from PIL import Image
from torchvision.transforms import v2
import torch

# Load model
sess = ort.InferenceSession(
    "eupe_vitt16_int8.onnx",
    providers=["CPUExecutionProvider"]
)

# Preprocess image
transform = v2.Compose([
    v2.ToImage(),
    v2.Resize((224, 224), antialias=True),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
img = Image.open("image.jpg").convert("RGB")
inp = transform(img)[None].numpy()   # (1, 3, 224, 224)

# Inference
cls_token, patch_tokens = sess.run(None, {"input": inp})
print(cls_token.shape)     # (1, 192)
print(patch_tokens.shape)  # (1, 196, 192)

What can you do with the outputs?

Task How
Image classification k-NN on cls_token
Image similarity / retrieval Cosine similarity of cls_token
Depth estimation Linear layer on patch_tokens
Semantic segmentation Linear layer on patch_tokens
Visual QA Feed patch_tokens into a language model

Edge Deployment Targets

Platform Recommended
Android / iOS INT8 + ONNX Runtime Mobile
Raspberry Pi / Jetson Nano INT8
Browser FP32 via onnxruntime-web
Server / Cloud FP32

Conversion Details

  • Exported with torch.onnx.export (legacy TorchScript path, opset 16)
  • Quantized with onnxruntime.quantization.quantize_dynamic (QInt8 weights)
  • Validated: max absolute diff FP32 vs ONNX < 0.0001 on all models

License

FAIR Noncommercial Research License. See original repo for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using rockerritesh/EUPE-ONNX 1

Paper for rockerritesh/EUPE-ONNX