Add image-feature-extraction tag

1676fd9 verified about 1 month ago

4.71 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- image-feature-extraction
	- feature-upsampling
	- pixel-dense-features
	- computer-vision
	- stable-diffusion
	- vae
	- image-upsampling
	- uplift
	datasets:
	- unsplash/lite
	---

	# UPLiFT for Stable Diffusion 1.5 VAE

	\| Input Image \| UPLiFT Upsampled Output \|
	\|:-----------:\|:-----------------------:\|
	\| ![Input](Gigi_3_512.png) \| ![UPLiFT Output](Gigi_3_512.png_uplift_sd1.5vae-2.png) \|

	This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the Stable Diffusion 1.5 VAE encoder.

	UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Backbone \| Stable Diffusion 1.5 VAE (`stable-diffusion-v1-5/stable-diffusion-v1-5`) \|
	\| Latent Channels \| 4 \|
	\| Patch Size \| 8 \|
	\| Upsampling Factor \| 2x per iteration \|
	\| Local Attender Size \| N=17 \|
	\| Training Dataset \| Unsplash-Lite \|
	\| Training Image Size \| 1024x1024 \|
	\| License \| MIT \|

	## Links

	- Paper: [https://arxiv.org/abs/2601.17950](https://arxiv.org/abs/2601.17950)
	- GitHub: [https://github.com/mwalmer-umd/UPLiFT](https://github.com/mwalmer-umd/UPLiFT)
	- Project Website: [https://www.cs.umd.edu/~mwalmer/uplift/](https://www.cs.umd.edu/~mwalmer/uplift/)

	## Installation

	```bash
	pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git'
	```

	## Quick Start

	```python
	import torch
	from PIL import Image

	# Load model (weights auto-download from HuggingFace)
	model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae')

	# Run inference - upsamples the image
	image = Image.open('your_image.jpg')
	upsampled_image = model(image)
	```

	## Usage Options

	### Adjust Upsampling Iterations

	Control the number of iterative upsampling steps (default: 2 for VAE):

	```python
	# Fewer iterations = lower memory usage
	model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2)
	```

	### Raw UPLiFT Model (Without Backbone)

	Load only the UPLiFT upsampling module without the SD VAE:

	```python
	model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae',
	include_extractor=False)
	```

	Note: We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly.

	## Architecture

	This UPLiFT variant is specifically designed for VAE latent upsampling and includes:

	1. Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
	2. Decoder: Upsamples latent features with noise channel concatenation for stochastic refinement
	3. Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features
	4. Refiner: An additional 12-layer refinement block with noise injection that enhances output quality

	Key differences from ViT-based UPLiFT models:
	- Uses layer normalization instead of batch normalization
	- Includes noise channel concatenation (4 channels) in decoder and refiner
	- Features a dedicated refiner module for enhanced image quality
	- Trained with latent-space noise augmentation

	## Intended Use

	This model is designed for:

	- High-quality image upsampling using Stable Diffusion's VAE
	- Super-resolution tasks
	- Enhancing image resolution while preserving details
	- Research on diffusion model components

	## Limitations

	- Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures
	- Output quality depends on the input image characteristics
	- Requires more computation than simpler upsampling methods
	- Best results achieved with images that match the training distribution (natural photographs)

	## Citation

	If you use UPLiFT in your research, please cite our paper.

	```
	@article{walmer2026uplift,
	title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
	author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
	journal={arXiv preprint arXiv:2601.17950},
	year={2026}
	}
	```

	## Acknowledgements

	This work builds upon:
	- [Stable Diffusion](https://github.com/CompVis/stable-diffusion) by Stability AI and CompVis
	- [Diffusers](https://github.com/huggingface/diffusers) by Hugging Face
	- [Unsplash](https://unsplash.com/) for the training dataset