UPLiFT for DINOv3-S+/16
This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the DINOv3-S+/16 backbone.
UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information.
Model Details
| Property | Value |
|---|---|
| Backbone | DINOv3-S+/16 (vit_small_plus_patch16_dinov3.lvd1689m) |
| Backbone Channels | 384 |
| Patch Size | 16 |
| Upsampling Factor | 2x per iteration |
| Local Attender Size | N=17 |
| Training Dataset | ImageNet |
| Training Image Size | 448x448 |
| License | MIT |
Links
- Paper: https://arxiv.org/abs/2601.17950
- GitHub: https://github.com/mwalmer-umd/UPLiFT
- Project Website: https://www.cs.umd.edu/~mwalmer/uplift/
Installation
pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git'
Quick Start
import torch
from PIL import Image
# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16')
# Run inference
image = Image.open('your_image.jpg')
features = model(image) # Returns pixel-dense features
Usage Options
Adjust Upsampling Iterations
Control the number of iterative upsampling steps (default: 4):
# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16', iters=4)
Raw UPLiFT Model (Without Backbone)
Load only the UPLiFT upsampling module without the DINOv3 backbone:
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
include_extractor=False)
Return Base Features
Get both upsampled and original backbone features:
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
return_base_feat=True)
upsampled_features, base_features = model(image)
Architecture
UPLiFT consists of:
- Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
- Decoder: Upsamples features using transposed convolutions with bilinear residual connections
- Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features
The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency.
Intended Use
This model is designed for:
- Creating pixel-dense feature maps from DINOv3 features
- Dense prediction tasks (semantic segmentation, depth estimation, etc.)
- Feature visualization and analysis
- Research on vision foundation models
Limitations
- Optimized specifically for DINOv3-S+/16 features; may not generalize to other backbones without retraining
- Performance depends on the quality of the underlying DINOv3 features
- Higher iteration counts increase computation time
Citation
If you use UPLiFT in your research, please cite our paper.
@article{walmer2026uplift,
title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
journal={arXiv preprint arXiv:2601.17950},
year={2026}
}
Acknowledgements
This work builds upon:


