Papers
arxiv:2602.24181

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Published on Feb 27
· Submitted by
Rishabh Kabra
on Mar 13
Authors:
,
,
,
,
,
,
,

Abstract

A novel vision encoder framework is presented that learns modality-agnostic feature representations by aligning multi-modal inputs while preserving semantic distinctions from a frozen teacher model.

AI-generated summary

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

Community

We adapt DINOv2 into an "omnivorous" encoder that produces consistent embeddings for different input modalities like RGB, depth, and segmentation maps. By aligning paired modalities while anchoring to a frozen DINOv2 teacher, we unlock better cross-modal retrieval and transfer to novel visual modalities, all while preserving DINOv2's pretrained semantics.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.24181 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.24181 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.24181 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.