UPLiFT for Stable Diffusion 1.5 VAE

Input Image UPLiFT Upsampled Output
Input UPLiFT Output

This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the Stable Diffusion 1.5 VAE encoder.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space.

Model Details

Property Value
Backbone Stable Diffusion 1.5 VAE (stable-diffusion-v1-5/stable-diffusion-v1-5)
Latent Channels 4
Patch Size 8
Upsampling Factor 2x per iteration
Local Attender Size N=17
Training Dataset Unsplash-Lite
Training Image Size 1024x1024
License MIT

Links

Installation

pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git'

Quick Start

import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae')

# Run inference - upsamples the image
image = Image.open('your_image.jpg')
upsampled_image = model(image)

Usage Options

Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 2 for VAE):

# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2)

Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the SD VAE:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae',
                       include_extractor=False)

Note: We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly.

Architecture

This UPLiFT variant is specifically designed for VAE latent upsampling and includes:

  1. Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
  2. Decoder: Upsamples latent features with noise channel concatenation for stochastic refinement
  3. Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features
  4. Refiner: An additional 12-layer refinement block with noise injection that enhances output quality

Key differences from ViT-based UPLiFT models:

  • Uses layer normalization instead of batch normalization
  • Includes noise channel concatenation (4 channels) in decoder and refiner
  • Features a dedicated refiner module for enhanced image quality
  • Trained with latent-space noise augmentation

Intended Use

This model is designed for:

  • High-quality image upsampling using Stable Diffusion's VAE
  • Super-resolution tasks
  • Enhancing image resolution while preserving details
  • Research on diffusion model components

Limitations

  • Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures
  • Output quality depends on the input image characteristics
  • Requires more computation than simpler upsampling methods
  • Best results achieved with images that match the training distribution (natural photographs)

Citation

If you use UPLiFT in your research, please cite our paper.

@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}

Acknowledgements

This work builds upon:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including UPLiFT-upsampler/uplift_sd1.5vae

Paper for UPLiFT-upsampler/uplift_sd1.5vae