UPLiFT for DINOv3-S+/16

Input Image	Base DINOv3 Features	UPLiFT Upsampled Features

This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the DINOv3-S+/16 backbone.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information.

Model Details

Property	Value
Backbone	DINOv3-S+/16 (`vit_small_plus_patch16_dinov3.lvd1689m`)
Backbone Channels	384
Patch Size	16
Upsampling Factor	2x per iteration
Local Attender Size	N=17
Training Dataset	ImageNet
Training Image Size	448x448
License	MIT

Installation

pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git'

Quick Start

import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16')

# Run inference
image = Image.open('your_image.jpg')
features = model(image)  # Returns pixel-dense features

Usage Options

Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 4):

# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16', iters=4)

Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the DINOv3 backbone:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       include_extractor=False)

Return Base Features

Get both upsampled and original backbone features:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       return_base_feat=True)
upsampled_features, base_features = model(image)

Architecture

UPLiFT consists of:

Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
Decoder: Upsamples features using transposed convolutions with bilinear residual connections
Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features

The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency.

Intended Use

This model is designed for:

Creating pixel-dense feature maps from DINOv3 features
Dense prediction tasks (semantic segmentation, depth estimation, etc.)
Feature visualization and analysis
Research on vision foundation models

Limitations

Optimized specifically for DINOv3-S+/16 features; may not generalize to other backbones without retraining
Performance depends on the quality of the underlying DINOv3 features
Higher iteration counts increase computation time

Citation

If you use UPLiFT in your research, please cite our paper.

@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}

Acknowledgements

This work builds upon:

DINOv3 by Meta AI
timm for model loading

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train UPLiFT-upsampler/uplift_dinov3-splus16

Collection including UPLiFT-upsampler/uplift_dinov3-splus16

UPLiFT

Collection

Official UPLiFT models for pixel-dense feature upsampling with Local Attenders. • 3 items • Updated Jan 27 • 1

Paper for UPLiFT-upsampler/uplift_dinov3-splus16

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

Paper • 2601.17950 • Published Jan 25 • 4

UPLiFT-upsampler
/

uplift_dinov3-splus16