| | --- |
| | license: mit |
| | library_name: pytorch |
| | tags: |
| | - image-feature-extraction |
| | - feature-upsampling |
| | - pixel-dense-features |
| | - computer-vision |
| | - stable-diffusion |
| | - vae |
| | - image-upsampling |
| | - uplift |
| | datasets: |
| | - unsplash/lite |
| | --- |
| | |
| | # UPLiFT for Stable Diffusion 1.5 VAE |
| |
|
| | | Input Image | UPLiFT Upsampled Output | |
| | |:-----------:|:-----------------------:| |
| | |  |  | |
| |
|
| | This is the official pretrained **UPLiFT** (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the **Stable Diffusion 1.5 VAE** encoder. |
| |
|
| | UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space. |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | **Backbone** | Stable Diffusion 1.5 VAE (`stable-diffusion-v1-5/stable-diffusion-v1-5`) | |
| | | **Latent Channels** | 4 | |
| | | **Patch Size** | 8 | |
| | | **Upsampling Factor** | 2x per iteration | |
| | | **Local Attender Size** | N=17 | |
| | | **Training Dataset** | Unsplash-Lite | |
| | | **Training Image Size** | 1024x1024 | |
| | | **License** | MIT | |
| |
|
| | ## Links |
| |
|
| | - **Paper**: [https://arxiv.org/abs/2601.17950](https://arxiv.org/abs/2601.17950) |
| | - **GitHub**: [https://github.com/mwalmer-umd/UPLiFT](https://github.com/mwalmer-umd/UPLiFT) |
| | - **Project Website**: [https://www.cs.umd.edu/~mwalmer/uplift/](https://www.cs.umd.edu/~mwalmer/uplift/) |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git' |
| | ``` |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | |
| | # Load model (weights auto-download from HuggingFace) |
| | model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae') |
| | |
| | # Run inference - upsamples the image |
| | image = Image.open('your_image.jpg') |
| | upsampled_image = model(image) |
| | ``` |
| |
|
| | ## Usage Options |
| |
|
| | ### Adjust Upsampling Iterations |
| |
|
| | Control the number of iterative upsampling steps (default: 2 for VAE): |
| |
|
| | ```python |
| | # Fewer iterations = lower memory usage |
| | model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2) |
| | ``` |
| |
|
| | ### Raw UPLiFT Model (Without Backbone) |
| |
|
| | Load only the UPLiFT upsampling module without the SD VAE: |
| |
|
| | ```python |
| | model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', |
| | include_extractor=False) |
| | ``` |
| |
|
| | **Note:** We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly. |
| |
|
| | ## Architecture |
| |
|
| | This UPLiFT variant is specifically designed for VAE latent upsampling and includes: |
| |
|
| | 1. **Encoder**: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling |
| | 2. **Decoder**: Upsamples latent features with noise channel concatenation for stochastic refinement |
| | 3. **Local Attender**: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features |
| | 4. **Refiner**: An additional 12-layer refinement block with noise injection that enhances output quality |
| |
|
| | Key differences from ViT-based UPLiFT models: |
| | - Uses layer normalization instead of batch normalization |
| | - Includes noise channel concatenation (4 channels) in decoder and refiner |
| | - Features a dedicated refiner module for enhanced image quality |
| | - Trained with latent-space noise augmentation |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed for: |
| |
|
| | - High-quality image upsampling using Stable Diffusion's VAE |
| | - Super-resolution tasks |
| | - Enhancing image resolution while preserving details |
| | - Research on diffusion model components |
| |
|
| | ## Limitations |
| |
|
| | - Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures |
| | - Output quality depends on the input image characteristics |
| | - Requires more computation than simpler upsampling methods |
| | - Best results achieved with images that match the training distribution (natural photographs) |
| |
|
| | ## Citation |
| |
|
| | If you use UPLiFT in your research, please cite our paper. |
| |
|
| | ``` |
| | @article{walmer2026uplift, |
| | title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders}, |
| | author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav}, |
| | journal={arXiv preprint arXiv:2601.17950}, |
| | year={2026} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgements |
| |
|
| | This work builds upon: |
| | - [Stable Diffusion](https://github.com/CompVis/stable-diffusion) by Stability AI and CompVis |
| | - [Diffusers](https://github.com/huggingface/diffusers) by Hugging Face |
| | - [Unsplash](https://unsplash.com/) for the training dataset |
| |
|