arxiv:2602.24181

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Published on Feb 27

· Submitted by

Rishabh Kabra on Mar 13

Deepmind

Upvote

Authors:

Abstract

A novel vision encoder framework is presented that learns modality-agnostic feature representations by aligning multi-modal inputs while preserving semantic distinctions from a frozen teacher model.

AI-generated summary

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

View arXiv page View PDF Add to collection

Community

rkabra

Paper submitter about 3 hours ago

•

edited about 3 hours ago

We adapt DINOv2 into an "omnivorous" encoder that produces consistent embeddings for different input modalities like RGB, depth, and segmentation maps. By aligning paired modalities while anchoring to a frozen DINOv2 teacher, we unlock better cross-modal retrieval and transfer to novel visual modalities, all while preserving DINOv2's pretrained semantics.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.24181 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.24181 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.24181 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.