facebook
/

EUPE-ViT-S

EUPE

English

Model card Files Files and versions

xet

Community

zcckernel commited on 18 days ago

Commit

68170b7

verified ·

1 Parent(s): c9238b6

Update README.md

Browse files

Files changed (1) hide show

README.md +252 -3

README.md CHANGED Viewed

@@ -1,3 +1,252 @@
----
-license: fair-noncommercial-research-license
----

+---
+tags:
+- eupe
+license: fair-noncommercial-research-license
+language:
+- en
+---
+# Model Card for EUPE
+Running AI models on smart edge devices can unlock various user experiences, but presents challenges
+due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision
+encoder with small size but powerful and versatile representations. We present our method, Efficient
+Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good
+representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert
+foundation vision encoders. Unlike previous agglomerative methods that directly scale down from
+multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large
+proxy teacher and then distilling from this single teacher. Experiments show that EUPE achieves
+on-par or better performance than individual domain experts of the same size on diverse task domains
+and also outperforms previous agglomerative encoders.
+## Model Details
+These are Vision Transformer and ConvNeXt models trained following the method described in the EUPE paper. 6 models are provided:
+- 3 ViT models including ViT-B16, ViT-S16, ViT-T16
+- 3 ConvNeXt models including ConvNeXt-{T/S/B}
+Each Transformer-based model takes an image as input and returns a class token, patch tokens. These models follow a ViT architecture, with a patch size of 16. For a 224x224 image, this results in 1 class token + 196 patch tokens = 197 tokens.
+The models can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.
+### Model Description
+- **Developed by:** Meta AI
+- **Model type:** Vision Transformer, ConvNeXt
+- **License:** [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/)
+### Model Sources
+- **Repository:** [https://github.com/facebookresearch/eupe](https://github.com/facebookresearch/eupe)
+- **Paper:** [https://arxiv.org/abs/2603.22387](https://arxiv.org/abs/2603.22387)
+## Uses
+The models are vision backbones providing multi-purpose features for downstream tasks, especially suitable for multi-task setting under limited compute budget.
+The models can be used without fine-tuning, with downstream modules ranging from non-parametric operators, simple linear layers to heavier language decoders, to obtain competitive results:
+- on image classification, using k-NN classifiers on the class token
+- on semantic 3D keypoint correspondances
+- on depth estimation, semantic segmentation, using linear layers
+- on visual question answering, connecting with language models
+## Get Started
+Follow the [Installation](https://github.com/facebookresearch/EUPE/tree/main?tab=readme-ov-file#installation) to set up the environment.
+Clone the [EUPE repo](https://github.com/facebookresearch/eupe) and download the PyTorch model checkpoints to local.
+The example below demonstrates how to obtain the class token and patch tokens given an input image.
+```python
+import torch
+import torchvision
+from torchvision.transforms import v2
+REPO_DIR = <PATH/TO/A/LOCAL/DIRECTORY/WHERE/THE/EUPE/REPO/WAS/CLONED>
+def get_img():
+    import requests
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+    return image
+def make_transform(resize_size: int = 256):
+    to_tensor = v2.ToImage()
+    resize = v2.Resize((resize_size, resize_size), antialias=True)
+    to_float = v2.ToDtype(torch.float32, scale=True)
+    normalize = v2.Normalize(
+        mean=(0.485, 0.456, 0.406),
+        std=(0.229, 0.224, 0.225),
+    )
+    return v2.Compose([to_tensor, resize, to_float, normalize])
+model = torch.hub.load(REPO_DIR, 'eupe_vits16', source='local', weights=<PATH/TO/THE/LOCAL/CHECKPOINT>)
+img_size = 256
+img = get_img()
+transform = make_transform(img_size)
+with torch.inference_mode():
+    with torch.autocast('cuda', dtype=torch.bfloat16):
+        batch_img = transform(img)[None]
+        outputs = model.forward_features(batch_img)
+clstoken, patchtokens = outputs["x_norm_clstoken"], outputs["x_norm_patchtokens"]
+```
+## Results
+The reader is referred to the associated paper for details on the evaluation protocols.
+*Results for ViT backbones*
+<table>
+  <thead>
+    <tr>
+      <th rowspan="2">Model</th>
+      <th rowspan="2">#Params</th>
+      <th colspan="2">Image Understanding</th>
+      <th colspan="6">Vision Language Modeling</th>
+      <th colspan="3">Dense Prediction</th>
+    </tr>
+    <tr>
+      <th>IN1k-ZS</th>
+      <th>IN1k-KNN</th>
+      <th>TextVQA</th>
+      <th>SQA</th>
+      <th>Realworld</th>
+      <th>POPE</th>
+      <th>GQA</th>
+      <th>MMEp</th>
+      <th>SPair</th>
+      <th>NYUv2↓</th>
+      <th>ADE20k</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>EUPE-ViT-T</td>
+      <td>6M</td>
+      <td>50.5</td>
+      <td>66.3</td>
+      <td>42.0</td>
+      <td>69.5</td>
+      <td>50.0</td>
+      <td>82.4</td>
+      <td>61.4</td>
+      <td>1258.0</td>
+      <td>37.2</td>
+      <td>0.571</td>
+      <td>36.7</td>
+    </tr>
+    <tr>
+      <td>EUPE-ViT-S</td>
+      <td>20M</td>
+      <td>69.8</td>
+      <td>78.2</td>
+      <td>44.1</td>
+      <td>69.3</td>
+      <td>51.7</td>
+      <td>84.5</td>
+      <td>65.0</td>
+      <td>1304.9</td>
+      <td>46.5</td>
+      <td>0.455</td>
+      <td>46.6</td>
+    </tr>
+    <tr>
+      <td>EUPE-ViT-B</td>
+      <td>86M</td>
+      <td>79.7</td>
+      <td>84.1</td>
+      <td>50.4</td>
+      <td>69.7</td>
+      <td>55.5</td>
+      <td>85.9</td>
+      <td>67.3</td>
+      <td>1374.5</td>
+      <td>51.3</td>
+      <td>0.391</td>
+      <td>52.4</td>
+    </tr>
+  </tbody>
+</table>
+*Results for ConvNeXt backbones
+<table>
+  <thead>
+    <tr>
+      <th rowspan="2">Model</th>
+      <th rowspan="2">#Params</th>
+      <th colspan="6">Vision Language Modeling</th>
+      <th colspan="3">Dense Prediction</th>
+    </tr>
+    <tr>
+      <th>TextVQA</th>
+      <th>SQA</th>
+      <th>Realworld</th>
+      <th>POPE</th>
+      <th>GQA</th>
+      <th>MMEp</th>
+      <th>SPair</th>
+      <th>NYUv2↓</th>
+      <th>ADE20k</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>EUPE-ConvNeXt-T</td>
+      <td>29M</td>
+      <td>43.7</td>
+      <td>68.8</td>
+      <td>47.9</td>
+      <td>83.4</td>
+      <td>63.0</td>
+      <td>1278.1</td>
+      <td>41.3</td>
+      <td>0.430</td>
+      <td>43.5</td>
+    </tr>
+    <tr>
+      <td>EUPE-ConvNeXt-S</td>
+      <td>50M</td>
+      <td>45.0</td>
+      <td>68.9</td>
+      <td>50.5</td>
+      <td>84.0</td>
+      <td>64.7</td>
+      <td>1284.2</td>
+      <td>40.1</td>
+      <td>0.388</td>
+      <td>46.8</td>
+    </tr>
+    <tr>
+      <td>EUPE-ConvNeXt-B</td>
+      <td>89M</td>
+      <td>46.4</td>
+      <td>70.1</td>
+      <td>53.3</td>
+      <td>84.7</td>
+      <td>65.8</td>
+      <td>1348.9</td>
+      <td>37.7</td>
+      <td>0.365</td>
+      <td>48.9</td>
+    </tr>
+  </tbody>
+</table>
+## Citation
+**BibTeX**
+```
+@misc{zhu2026eupe,
+  title={Efficient Universal Perception Encoder},
+  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
+  year={2026},
+  eprint={2603.22387},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2603.22387},
+}
+```