UPLiFT for DINOv3-S+/16

Input Image Base DINOv3 Features UPLiFT Upsampled Features
Input Base Features UPLiFT Features

This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the DINOv3-S+/16 backbone.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information.

Model Details

Property Value
Backbone DINOv3-S+/16 (vit_small_plus_patch16_dinov3.lvd1689m)
Backbone Channels 384
Patch Size 16
Upsampling Factor 2x per iteration
Local Attender Size N=17
Training Dataset ImageNet
Training Image Size 448x448
License MIT

Links

Installation

pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git'

Quick Start

import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16')

# Run inference
image = Image.open('your_image.jpg')
features = model(image)  # Returns pixel-dense features

Usage Options

Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 4):

# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16', iters=4)

Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the DINOv3 backbone:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       include_extractor=False)

Return Base Features

Get both upsampled and original backbone features:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       return_base_feat=True)
upsampled_features, base_features = model(image)

Architecture

UPLiFT consists of:

  1. Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
  2. Decoder: Upsamples features using transposed convolutions with bilinear residual connections
  3. Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features

The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency.

Intended Use

This model is designed for:

  • Creating pixel-dense feature maps from DINOv3 features
  • Dense prediction tasks (semantic segmentation, depth estimation, etc.)
  • Feature visualization and analysis
  • Research on vision foundation models

Limitations

  • Optimized specifically for DINOv3-S+/16 features; may not generalize to other backbones without retraining
  • Performance depends on the quality of the underlying DINOv3 features
  • Higher iteration counts increase computation time

Citation

If you use UPLiFT in your research, please cite our paper.

@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}

Acknowledgements

This work builds upon:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train UPLiFT-upsampler/uplift_dinov3-splus16

Collection including UPLiFT-upsampler/uplift_dinov3-splus16

Paper for UPLiFT-upsampler/uplift_dinov3-splus16