uplift_sd1.5vae / README.md
AniAggarwal's picture
Add image-feature-extraction tag
1676fd9 verified
---
license: mit
library_name: pytorch
tags:
- image-feature-extraction
- feature-upsampling
- pixel-dense-features
- computer-vision
- stable-diffusion
- vae
- image-upsampling
- uplift
datasets:
- unsplash/lite
---
# UPLiFT for Stable Diffusion 1.5 VAE
| Input Image | UPLiFT Upsampled Output |
|:-----------:|:-----------------------:|
| ![Input](Gigi_3_512.png) | ![UPLiFT Output](Gigi_3_512.png_uplift_sd1.5vae-2.png) |
This is the official pretrained **UPLiFT** (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the **Stable Diffusion 1.5 VAE** encoder.
UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space.
## Model Details
| Property | Value |
|----------|-------|
| **Backbone** | Stable Diffusion 1.5 VAE (`stable-diffusion-v1-5/stable-diffusion-v1-5`) |
| **Latent Channels** | 4 |
| **Patch Size** | 8 |
| **Upsampling Factor** | 2x per iteration |
| **Local Attender Size** | N=17 |
| **Training Dataset** | Unsplash-Lite |
| **Training Image Size** | 1024x1024 |
| **License** | MIT |
## Links
- **Paper**: [https://arxiv.org/abs/2601.17950](https://arxiv.org/abs/2601.17950)
- **GitHub**: [https://github.com/mwalmer-umd/UPLiFT](https://github.com/mwalmer-umd/UPLiFT)
- **Project Website**: [https://www.cs.umd.edu/~mwalmer/uplift/](https://www.cs.umd.edu/~mwalmer/uplift/)
## Installation
```bash
pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git'
```
## Quick Start
```python
import torch
from PIL import Image
# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae')
# Run inference - upsamples the image
image = Image.open('your_image.jpg')
upsampled_image = model(image)
```
## Usage Options
### Adjust Upsampling Iterations
Control the number of iterative upsampling steps (default: 2 for VAE):
```python
# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2)
```
### Raw UPLiFT Model (Without Backbone)
Load only the UPLiFT upsampling module without the SD VAE:
```python
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae',
include_extractor=False)
```
**Note:** We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly.
## Architecture
This UPLiFT variant is specifically designed for VAE latent upsampling and includes:
1. **Encoder**: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
2. **Decoder**: Upsamples latent features with noise channel concatenation for stochastic refinement
3. **Local Attender**: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features
4. **Refiner**: An additional 12-layer refinement block with noise injection that enhances output quality
Key differences from ViT-based UPLiFT models:
- Uses layer normalization instead of batch normalization
- Includes noise channel concatenation (4 channels) in decoder and refiner
- Features a dedicated refiner module for enhanced image quality
- Trained with latent-space noise augmentation
## Intended Use
This model is designed for:
- High-quality image upsampling using Stable Diffusion's VAE
- Super-resolution tasks
- Enhancing image resolution while preserving details
- Research on diffusion model components
## Limitations
- Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures
- Output quality depends on the input image characteristics
- Requires more computation than simpler upsampling methods
- Best results achieved with images that match the training distribution (natural photographs)
## Citation
If you use UPLiFT in your research, please cite our paper.
```
@article{walmer2026uplift,
title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
journal={arXiv preprint arXiv:2601.17950},
year={2026}
}
```
## Acknowledgements
This work builds upon:
- [Stable Diffusion](https://github.com/CompVis/stable-diffusion) by Stability AI and CompVis
- [Diffusers](https://github.com/huggingface/diffusers) by Hugging Face
- [Unsplash](https://unsplash.com/) for the training dataset