EUPE
English
zcckernel commited on
Commit
68170b7
·
verified ·
1 Parent(s): c9238b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +252 -3
README.md CHANGED
@@ -1,3 +1,252 @@
1
- ---
2
- license: fair-noncommercial-research-license
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - eupe
4
+ license: fair-noncommercial-research-license
5
+ language:
6
+ - en
7
+ ---
8
+ # Model Card for EUPE
9
+
10
+ Running AI models on smart edge devices can unlock various user experiences, but presents challenges
11
+ due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision
12
+ encoder with small size but powerful and versatile representations. We present our method, Efficient
13
+ Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good
14
+ representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert
15
+ foundation vision encoders. Unlike previous agglomerative methods that directly scale down from
16
+ multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large
17
+ proxy teacher and then distilling from this single teacher. Experiments show that EUPE achieves
18
+ on-par or better performance than individual domain experts of the same size on diverse task domains
19
+ and also outperforms previous agglomerative encoders.
20
+
21
+ ## Model Details
22
+
23
+ These are Vision Transformer and ConvNeXt models trained following the method described in the EUPE paper. 6 models are provided:
24
+
25
+ - 3 ViT models including ViT-B16, ViT-S16, ViT-T16
26
+ - 3 ConvNeXt models including ConvNeXt-{T/S/B}
27
+
28
+ Each Transformer-based model takes an image as input and returns a class token, patch tokens. These models follow a ViT architecture, with a patch size of 16. For a 224x224 image, this results in 1 class token + 196 patch tokens = 197 tokens.
29
+
30
+ The models can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.
31
+
32
+ ### Model Description
33
+
34
+ - **Developed by:** Meta AI
35
+ - **Model type:** Vision Transformer, ConvNeXt
36
+ - **License:** [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/)
37
+
38
+ ### Model Sources
39
+
40
+ - **Repository:** [https://github.com/facebookresearch/eupe](https://github.com/facebookresearch/eupe)
41
+ - **Paper:** [https://arxiv.org/abs/2603.22387](https://arxiv.org/abs/2603.22387)
42
+
43
+ ## Uses
44
+
45
+ The models are vision backbones providing multi-purpose features for downstream tasks, especially suitable for multi-task setting under limited compute budget.
46
+ The models can be used without fine-tuning, with downstream modules ranging from non-parametric operators, simple linear layers to heavier language decoders, to obtain competitive results:
47
+
48
+ - on image classification, using k-NN classifiers on the class token
49
+ - on semantic 3D keypoint correspondances
50
+ - on depth estimation, semantic segmentation, using linear layers
51
+ - on visual question answering, connecting with language models
52
+
53
+ ## Get Started
54
+
55
+ Follow the [Installation](https://github.com/facebookresearch/EUPE/tree/main?tab=readme-ov-file#installation) to set up the environment.
56
+ Clone the [EUPE repo](https://github.com/facebookresearch/eupe) and download the PyTorch model checkpoints to local.
57
+ The example below demonstrates how to obtain the class token and patch tokens given an input image.
58
+
59
+ ```python
60
+ import torch
61
+ import torchvision
62
+ from torchvision.transforms import v2
63
+
64
+ REPO_DIR = <PATH/TO/A/LOCAL/DIRECTORY/WHERE/THE/EUPE/REPO/WAS/CLONED>
65
+
66
+ def get_img():
67
+ import requests
68
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
69
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
70
+ return image
71
+
72
+ def make_transform(resize_size: int = 256):
73
+ to_tensor = v2.ToImage()
74
+ resize = v2.Resize((resize_size, resize_size), antialias=True)
75
+ to_float = v2.ToDtype(torch.float32, scale=True)
76
+ normalize = v2.Normalize(
77
+ mean=(0.485, 0.456, 0.406),
78
+ std=(0.229, 0.224, 0.225),
79
+ )
80
+ return v2.Compose([to_tensor, resize, to_float, normalize])
81
+
82
+ model = torch.hub.load(REPO_DIR, 'eupe_vits16', source='local', weights=<PATH/TO/THE/LOCAL/CHECKPOINT>)
83
+
84
+ img_size = 256
85
+ img = get_img()
86
+ transform = make_transform(img_size)
87
+ with torch.inference_mode():
88
+ with torch.autocast('cuda', dtype=torch.bfloat16):
89
+ batch_img = transform(img)[None]
90
+ outputs = model.forward_features(batch_img)
91
+ clstoken, patchtokens = outputs["x_norm_clstoken"], outputs["x_norm_patchtokens"]
92
+
93
+ ```
94
+
95
+ ## Results
96
+
97
+ The reader is referred to the associated paper for details on the evaluation protocols.
98
+
99
+ *Results for ViT backbones*
100
+
101
+ <table>
102
+ <thead>
103
+ <tr>
104
+ <th rowspan="2">Model</th>
105
+ <th rowspan="2">#Params</th>
106
+ <th colspan="2">Image Understanding</th>
107
+ <th colspan="6">Vision Language Modeling</th>
108
+ <th colspan="3">Dense Prediction</th>
109
+ </tr>
110
+ <tr>
111
+ <th>IN1k-ZS</th>
112
+ <th>IN1k-KNN</th>
113
+ <th>TextVQA</th>
114
+ <th>SQA</th>
115
+ <th>Realworld</th>
116
+ <th>POPE</th>
117
+ <th>GQA</th>
118
+ <th>MMEp</th>
119
+ <th>SPair</th>
120
+ <th>NYUv2↓</th>
121
+ <th>ADE20k</th>
122
+ </tr>
123
+ </thead>
124
+ <tbody>
125
+ <tr>
126
+ <td>EUPE-ViT-T</td>
127
+ <td>6M</td>
128
+ <td>50.5</td>
129
+ <td>66.3</td>
130
+ <td>42.0</td>
131
+ <td>69.5</td>
132
+ <td>50.0</td>
133
+ <td>82.4</td>
134
+ <td>61.4</td>
135
+ <td>1258.0</td>
136
+ <td>37.2</td>
137
+ <td>0.571</td>
138
+ <td>36.7</td>
139
+ </tr>
140
+ <tr>
141
+ <td>EUPE-ViT-S</td>
142
+ <td>20M</td>
143
+ <td>69.8</td>
144
+ <td>78.2</td>
145
+ <td>44.1</td>
146
+ <td>69.3</td>
147
+ <td>51.7</td>
148
+ <td>84.5</td>
149
+ <td>65.0</td>
150
+ <td>1304.9</td>
151
+ <td>46.5</td>
152
+ <td>0.455</td>
153
+ <td>46.6</td>
154
+ </tr>
155
+ <tr>
156
+ <td>EUPE-ViT-B</td>
157
+ <td>86M</td>
158
+ <td>79.7</td>
159
+ <td>84.1</td>
160
+ <td>50.4</td>
161
+ <td>69.7</td>
162
+ <td>55.5</td>
163
+ <td>85.9</td>
164
+ <td>67.3</td>
165
+ <td>1374.5</td>
166
+ <td>51.3</td>
167
+ <td>0.391</td>
168
+ <td>52.4</td>
169
+ </tr>
170
+ </tbody>
171
+ </table>
172
+
173
+ *Results for ConvNeXt backbones
174
+
175
+ <table>
176
+ <thead>
177
+ <tr>
178
+ <th rowspan="2">Model</th>
179
+ <th rowspan="2">#Params</th>
180
+ <th colspan="6">Vision Language Modeling</th>
181
+ <th colspan="3">Dense Prediction</th>
182
+ </tr>
183
+ <tr>
184
+ <th>TextVQA</th>
185
+ <th>SQA</th>
186
+ <th>Realworld</th>
187
+ <th>POPE</th>
188
+ <th>GQA</th>
189
+ <th>MMEp</th>
190
+ <th>SPair</th>
191
+ <th>NYUv2↓</th>
192
+ <th>ADE20k</th>
193
+ </tr>
194
+ </thead>
195
+ <tbody>
196
+ <tr>
197
+ <td>EUPE-ConvNeXt-T</td>
198
+ <td>29M</td>
199
+ <td>43.7</td>
200
+ <td>68.8</td>
201
+ <td>47.9</td>
202
+ <td>83.4</td>
203
+ <td>63.0</td>
204
+ <td>1278.1</td>
205
+ <td>41.3</td>
206
+ <td>0.430</td>
207
+ <td>43.5</td>
208
+ </tr>
209
+ <tr>
210
+ <td>EUPE-ConvNeXt-S</td>
211
+ <td>50M</td>
212
+ <td>45.0</td>
213
+ <td>68.9</td>
214
+ <td>50.5</td>
215
+ <td>84.0</td>
216
+ <td>64.7</td>
217
+ <td>1284.2</td>
218
+ <td>40.1</td>
219
+ <td>0.388</td>
220
+ <td>46.8</td>
221
+ </tr>
222
+ <tr>
223
+ <td>EUPE-ConvNeXt-B</td>
224
+ <td>89M</td>
225
+ <td>46.4</td>
226
+ <td>70.1</td>
227
+ <td>53.3</td>
228
+ <td>84.7</td>
229
+ <td>65.8</td>
230
+ <td>1348.9</td>
231
+ <td>37.7</td>
232
+ <td>0.365</td>
233
+ <td>48.9</td>
234
+ </tr>
235
+ </tbody>
236
+ </table>
237
+
238
+ ## Citation
239
+
240
+ **BibTeX**
241
+
242
+ ```
243
+ @misc{zhu2026eupe,
244
+ title={Efficient Universal Perception Encoder},
245
+ author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
246
+ year={2026},
247
+ eprint={2603.22387},
248
+ archivePrefix={arXiv},
249
+ primaryClass={cs.CV},
250
+ url={https://arxiv.org/abs/2603.22387},
251
+ }
252
+ ```