Implicit Neural Representation Facilitates Unified Universal Vision Encoding
Abstract
A unified model learns image representations useful for both recognition and generation by using a hyper-network for implicit neural representation with knowledge distillation, achieving state-of-the-art results while enabling generative capabilities through compressed embeddings.
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Next-Embedding Prediction Makes Strong Vision Learners (2025)
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction (2025)
- In Pursuit of Pixel Supervision for Visual Pre-training (2025)
- Visual Generation Tuning (2025)
- Revisiting Multi-Task Visual Representation Learning (2026)
- Recurrent Video Masked Autoencoders (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper