Abstract
A novel vision encoder framework is presented that learns modality-agnostic feature representations by aligning multi-modal inputs while preserving semantic distinctions from a frozen teacher model.
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
Community
We adapt DINOv2 into an "omnivorous" encoder that produces consistent embeddings for different input modalities like RGB, depth, and segmentation maps. By aligning paired modalities while anchoring to a frozen DINOv2 teacher, we unlock better cross-modal retrieval and transfer to novel visual modalities, all while preserving DINOv2's pretrained semantics.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper