EUPE
English
File size: 7,532 Bytes
55ec934
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
tags:
- eupe
license: fair-noncommercial-research-license
language:
- en
---
# Model Card for EUPE

Running AI models on smart edge devices can unlock various user experiences, but presents challenges
due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision
encoder with small size but powerful and versatile representations. We present our method, Efficient
Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good
representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert
foundation vision encoders. Unlike previous agglomerative methods that directly scale down from
multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large
proxy teacher and then distilling from this single teacher. Experiments show that EUPE achieves
on-par or better performance than individual domain experts of the same size on diverse task domains
and also outperforms previous agglomerative encoders.

## Model Details

These are Vision Transformer and ConvNeXt models trained following the method described in the EUPE paper. 6 models are provided:

- 3 ViT models including ViT-B16, ViT-S16, ViT-T16
- 3 ConvNeXt models including ConvNeXt-{T/S/B}

Each Transformer-based model takes an image as input and returns a class token, patch tokens. These models follow a ViT architecture, with a patch size of 16. For a 224x224 image, this results in 1 class token + 196 patch tokens = 197 tokens.

The models can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.

### Model Description

- **Developed by:** Meta AI
- **Model type:** Vision Transformer, ConvNeXt
- **License:** [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/)

### Model Sources

- **Repository:** [https://github.com/facebookresearch/eupe](https://github.com/facebookresearch/eupe)
- **Paper:** [https://arxiv.org/abs/2603.22387](https://arxiv.org/abs/2603.22387)

## Uses

The models are vision backbones providing multi-purpose features for downstream tasks, especially suitable for multi-task setting under limited compute budget. 
The models can be used without fine-tuning, with downstream modules ranging from non-parametric operators, simple linear layers to heavier language decoders, to obtain competitive results:

- on image classification, using k-NN classifiers on the class token
- on semantic 3D keypoint correspondances
- on depth estimation, semantic segmentation, using linear layers
- on visual question answering, connecting with language models

## Get Started

Follow the [Installation](https://github.com/facebookresearch/EUPE/tree/main?tab=readme-ov-file#installation) to set up the environment. 
Clone the [EUPE repo](https://github.com/facebookresearch/eupe) and download the PyTorch model checkpoints to local.
The example below demonstrates how to obtain the class token and patch tokens given an input image.

```python
import torch
import torchvision
from torchvision.transforms import v2

REPO_DIR = <PATH/TO/A/LOCAL/DIRECTORY/WHERE/THE/EUPE/REPO/WAS/CLONED>

def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int = 256):
    to_tensor = v2.ToImage()
    resize = v2.Resize((resize_size, resize_size), antialias=True)
    to_float = v2.ToDtype(torch.float32, scale=True)
    normalize = v2.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return v2.Compose([to_tensor, resize, to_float, normalize])

model = torch.hub.load(REPO_DIR, 'eupe_vitb16', source='local', weights=<PATH/TO/THE/LOCAL/CHECKPOINT>)

img_size = 256
img = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        outputs = model.forward_features(batch_img)
clstoken, patchtokens = outputs["x_norm_clstoken"], outputs["x_norm_patchtokens"]

```

## Results

The reader is referred to the associated paper for details on the evaluation protocols.

*Results for ViT backbones*

<table>
  <thead>
    <tr>
      <th rowspan="2">Model</th>
      <th rowspan="2">#Params</th>
      <th colspan="2">Image Understanding</th>
      <th colspan="6">Vision Language Modeling</th>
      <th colspan="3">Dense Prediction</th>
    </tr>
    <tr>
      <th>IN1k-ZS</th>
      <th>IN1k-KNN</th>
      <th>TextVQA</th>
      <th>SQA</th>
      <th>Realworld</th>
      <th>POPE</th>
      <th>GQA</th>
      <th>MMEp</th>
      <th>SPair</th>
      <th>NYUv2↓</th>
      <th>ADE20k</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>EUPE-ViT-T</td>
      <td>6M</td>
      <td>50.5</td>
      <td>66.3</td>
      <td>42.0</td>
      <td>69.5</td>
      <td>50.0</td>
      <td>82.4</td>
      <td>61.4</td>
      <td>1258.0</td>
      <td>37.2</td>
      <td>0.571</td>
      <td>36.7</td>
    </tr>
    <tr>
      <td>EUPE-ViT-S</td>
      <td>20M</td>
      <td>69.8</td>
      <td>78.2</td>
      <td>44.1</td>
      <td>69.3</td>
      <td>51.7</td>
      <td>84.5</td>
      <td>65.0</td>
      <td>1304.9</td>
      <td>46.5</td>
      <td>0.455</td>
      <td>46.6</td>
    </tr>
    <tr>
      <td>EUPE-ViT-B</td>
      <td>86M</td>
      <td>79.7</td>
      <td>84.1</td>
      <td>50.4</td>
      <td>69.7</td>
      <td>55.5</td>
      <td>85.9</td>
      <td>67.3</td>
      <td>1374.5</td>
      <td>51.3</td>
      <td>0.391</td>
      <td>52.4</td>
    </tr>
  </tbody>
</table>

*Results for ConvNeXt backbones

<table>
  <thead>
    <tr>
      <th rowspan="2">Model</th>
      <th rowspan="2">#Params</th>
      <th colspan="6">Vision Language Modeling</th>
      <th colspan="3">Dense Prediction</th>
    </tr>
    <tr>
      <th>TextVQA</th>
      <th>SQA</th>
      <th>Realworld</th>
      <th>POPE</th>
      <th>GQA</th>
      <th>MMEp</th>
      <th>SPair</th>
      <th>NYUv2↓</th>
      <th>ADE20k</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>EUPE-ConvNeXt-T</td>
      <td>29M</td>
      <td>43.7</td>
      <td>68.8</td>
      <td>47.9</td>
      <td>83.4</td>
      <td>63.0</td>
      <td>1278.1</td>
      <td>41.3</td>
      <td>0.430</td>
      <td>43.5</td>
    </tr>
    <tr>
      <td>EUPE-ConvNeXt-S</td>
      <td>50M</td>
      <td>45.0</td>
      <td>68.9</td>
      <td>50.5</td>
      <td>84.0</td>
      <td>64.7</td>
      <td>1284.2</td>
      <td>40.1</td>
      <td>0.388</td>
      <td>46.8</td>
    </tr>
    <tr>
      <td>EUPE-ConvNeXt-B</td>
      <td>89M</td>
      <td>46.4</td>
      <td>70.1</td>
      <td>53.3</td>
      <td>84.7</td>
      <td>65.8</td>
      <td>1348.9</td>
      <td>37.7</td>
      <td>0.365</td>
      <td>48.9</td>
    </tr>
  </tbody>
</table>

## Citation

**BibTeX**

```
@misc{zhu2026eupe,
  title={Efficient Universal Perception Encoder},
  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
  year={2026},
  eprint={2603.22387},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.22387},
}
```