UPLiFT for Stable Diffusion 1.5 VAE

Input Image	UPLiFT Upsampled Output

This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the Stable Diffusion 1.5 VAE encoder.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space.

Model Details

Property	Value
Backbone	Stable Diffusion 1.5 VAE (`stable-diffusion-v1-5/stable-diffusion-v1-5`)
Latent Channels	4
Patch Size	8
Upsampling Factor	2x per iteration
Local Attender Size	N=17
Training Dataset	Unsplash-Lite
Training Image Size	1024x1024
License	MIT

Installation

pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git'

Quick Start

import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae')

# Run inference - upsamples the image
image = Image.open('your_image.jpg')
upsampled_image = model(image)

Usage Options

Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 2 for VAE):

# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2)

Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the SD VAE:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae',
                       include_extractor=False)

Note: We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly.

Architecture

This UPLiFT variant is specifically designed for VAE latent upsampling and includes:

Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
Decoder: Upsamples latent features with noise channel concatenation for stochastic refinement
Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features
Refiner: An additional 12-layer refinement block with noise injection that enhances output quality

Key differences from ViT-based UPLiFT models:

Uses layer normalization instead of batch normalization
Includes noise channel concatenation (4 channels) in decoder and refiner
Features a dedicated refiner module for enhanced image quality
Trained with latent-space noise augmentation

Intended Use

This model is designed for:

High-quality image upsampling using Stable Diffusion's VAE
Super-resolution tasks
Enhancing image resolution while preserving details
Research on diffusion model components

Limitations

Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures
Output quality depends on the input image characteristics
Requires more computation than simpler upsampling methods
Best results achieved with images that match the training distribution (natural photographs)

Citation

If you use UPLiFT in your research, please cite our paper.

@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}

Acknowledgements

This work builds upon:

Stable Diffusion by Stability AI and CompVis
Diffusers by Hugging Face
Unsplash for the training dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including UPLiFT-upsampler/uplift_sd1.5vae

UPLiFT

Collection

Official UPLiFT models for pixel-dense feature upsampling with Local Attenders. • 3 items • Updated 1 day ago • 1

Paper for UPLiFT-upsampler/uplift_sd1.5vae

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

Paper • 2601.17950 • Published 3 days ago

UPLiFT-upsampler
/

uplift_sd1.5vae