UPLiFT for Stable Diffusion 1.5 VAE
This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the Stable Diffusion 1.5 VAE encoder.
UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space.
Model Details
| Property | Value |
|---|---|
| Backbone | Stable Diffusion 1.5 VAE (stable-diffusion-v1-5/stable-diffusion-v1-5) |
| Latent Channels | 4 |
| Patch Size | 8 |
| Upsampling Factor | 2x per iteration |
| Local Attender Size | N=17 |
| Training Dataset | Unsplash-Lite |
| Training Image Size | 1024x1024 |
| License | MIT |
Links
- Paper: https://arxiv.org/abs/2601.17950
- GitHub: https://github.com/mwalmer-umd/UPLiFT
- Project Website: https://www.cs.umd.edu/~mwalmer/uplift/
Installation
pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git'
Quick Start
import torch
from PIL import Image
# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae')
# Run inference - upsamples the image
image = Image.open('your_image.jpg')
upsampled_image = model(image)
Usage Options
Adjust Upsampling Iterations
Control the number of iterative upsampling steps (default: 2 for VAE):
# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2)
Raw UPLiFT Model (Without Backbone)
Load only the UPLiFT upsampling module without the SD VAE:
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae',
include_extractor=False)
Note: We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly.
Architecture
This UPLiFT variant is specifically designed for VAE latent upsampling and includes:
- Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
- Decoder: Upsamples latent features with noise channel concatenation for stochastic refinement
- Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features
- Refiner: An additional 12-layer refinement block with noise injection that enhances output quality
Key differences from ViT-based UPLiFT models:
- Uses layer normalization instead of batch normalization
- Includes noise channel concatenation (4 channels) in decoder and refiner
- Features a dedicated refiner module for enhanced image quality
- Trained with latent-space noise augmentation
Intended Use
This model is designed for:
- High-quality image upsampling using Stable Diffusion's VAE
- Super-resolution tasks
- Enhancing image resolution while preserving details
- Research on diffusion model components
Limitations
- Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures
- Output quality depends on the input image characteristics
- Requires more computation than simpler upsampling methods
- Best results achieved with images that match the training distribution (natural photographs)
Citation
If you use UPLiFT in your research, please cite our paper.
@article{walmer2026uplift,
title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
journal={arXiv preprint arXiv:2601.17950},
year={2026}
}
Acknowledgements
This work builds upon:
- Stable Diffusion by Stability AI and CompVis
- Diffusers by Hugging Face
- Unsplash for the training dataset

