EUPE — All 6 Models in ONNX (FP32 + INT8)

ONNX exports of Meta AI's Efficient Universal Perception Encoder (EUPE) — a single lightweight vision backbone that matches or exceeds domain-specialist models on diverse tasks, designed for on-device / edge deployment.

📄 Paper: arxiv 2603.22387
🔥 Original weights: facebook/EUPE collection
💻 Code: facebookresearch/EUPE

Files

File	Architecture	Params	Size	Type
`eupe_vitt16.onnx`	ViT-T/16	6M	22.1 MB	FP32
`eupe_vitt16_int8.onnx`	ViT-T/16	6M	5.9 MB	INT8 ✅ smallest
`eupe_vits16.onnx`	ViT-S/16	21M	86.6 MB	FP32
`eupe_vits16_int8.onnx`	ViT-S/16	21M	22.2 MB	INT8
`eupe_vitb16.onnx`	ViT-B/16	86M	342.8 MB	FP32
`eupe_vitb16_int8.onnx`	ViT-B/16	86M	86.4 MB	INT8
`eupe_convnext-tiny.onnx`	ConvNeXt-T	29M	111.4 MB	FP32
`eupe_convnext-tiny_int8.onnx`	ConvNeXt-T	29M	28.2 MB	INT8
`eupe_convnext-small.onnx`	ConvNeXt-S	50M	198.0 MB	FP32
`eupe_convnext-small_int8.onnx`	ConvNeXt-S	50M	50.2 MB	INT8
`eupe_convnext-base.onnx`	ConvNeXt-B	89M	350.4 MB	FP32
`eupe_convnext-base_int8.onnx`	ConvNeXt-B	89M	88.5 MB	INT8

INT8 models are ~75% smaller than FP32 with negligible accuracy loss.

Inputs & Outputs

ViT models (`vitt16`, `vits16`, `vitb16`)

	Name	Shape	dtype
Input	`input`	`[batch, 3, 224, 224]`	float32
Output 0	`cls_token`	`[batch, D]`	float32
Output 1	`patch_tokens`	`[batch, 196, D]`	float32

Where D = 192 (ViT-T) / 384 (ViT-S) / 768 (ViT-B)

ConvNeXt models (`convnext-tiny`, `convnext-small`, `convnext-base`)

	Name	Shape	dtype
Input	`input`	`[batch, 3, 224, 224]`	float32
Output	`features`	`[batch, D]`	float32

Where D = 768 (Tiny/Small) / 1024 (Base)

Preprocessing: ImageNet normalisation — mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225]

Quick Start

import onnxruntime as ort
import numpy as np
from PIL import Image
from torchvision.transforms import v2
import torch

# Load model
sess = ort.InferenceSession(
    "eupe_vitt16_int8.onnx",
    providers=["CPUExecutionProvider"]
)

# Preprocess image
transform = v2.Compose([
    v2.ToImage(),
    v2.Resize((224, 224), antialias=True),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
img = Image.open("image.jpg").convert("RGB")
inp = transform(img)[None].numpy()   # (1, 3, 224, 224)

# Inference
cls_token, patch_tokens = sess.run(None, {"input": inp})
print(cls_token.shape)     # (1, 192)
print(patch_tokens.shape)  # (1, 196, 192)

What can you do with the outputs?

Task	How
Image classification	k-NN on `cls_token`
Image similarity / retrieval	Cosine similarity of `cls_token`
Depth estimation	Linear layer on `patch_tokens`
Semantic segmentation	Linear layer on `patch_tokens`
Visual QA	Feed `patch_tokens` into a language model

Edge Deployment Targets

Platform	Recommended
Android / iOS	INT8 + ONNX Runtime Mobile
Raspberry Pi / Jetson Nano	INT8
Browser	FP32 via onnxruntime-web
Server / Cloud	FP32

Conversion Details

Exported with torch.onnx.export (legacy TorchScript path, opset 16)
Quantized with onnxruntime.quantization.quantize_dynamic (QInt8 weights)
Validated: max absolute diff FP32 vs ONNX < 0.0001 on all models

License

FAIR Noncommercial Research License. See original repo for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using rockerritesh/EUPE-ONNX 1

Paper for rockerritesh/EUPE-ONNX

Efficient Universal Perception Encoder

Paper • 2603.22387 • Published Mar 23 • 10