Improve model card: Add pipeline tag, library name, abstract, and usage instructions

This PR significantly improves the model card by:
- Adding `pipeline_tag: image-to-image` and `library_name: diffusers` to the metadata, enhancing discoverability on the Hub and enabling the Hugging Face `use this model` widget.
- Including a link to the paper [Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training](https://huggingface.co/papers/2507.20291) and the official GitHub repository [https://github.com/Joyies/TVT](https://github.com/Joyies/TVT).
- Incorporating the paper's abstract for a quick overview of the model's capabilities.
- Providing detailed 'Quick Inference' instructions, including installation steps and command examples, directly from the project's GitHub README, to facilitate easy usage.

Files changed (1) hide show

README.md +107 -3

README.md CHANGED Viewed

@@ -1,3 +1,107 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: image-to-image
+library_name: diffusers
+---
+# Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training
+This repository contains the official model weights for the paper [Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training](https://huggingface.co/papers/2507.20291), which introduces the **Transfer VAE Training (TVT)** strategy for real-world image super-resolution.
+The official code is available at: [https://github.com/Joyies/TVT](https://github.com/Joyies/TVT)
+<div align="center">
+  <img src="https://huggingface.co/Joypop/TVTSR/resolve/main/assets/teaser.png" width="100%"/>
+</div>
+## Abstract
+Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models.
+## Quick Inference
+This section provides instructions on how to perform inference using the pre-trained models.
+### Installation
+First, clone the repository and set up the environment:
+```shell
+git clone https://github.com/Joyies/TVT.git
+cd TVT
+# create an environment
+conda create -n TVT python=3.10
+conda activate TVT
+pip install --upgrade pip
+pip install -r requirements.txt
+```
+### Step 1: Download the pretrained models
+-   Download the pretrained SD-2.1-base models from [Hugging Face SD 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
+-   Download the model weights ([VAED4](https://huggingface.co/Joypop/TVTSR/tree/main/ckp), [TVT model](https://huggingface.co/Joypop/TVTSR/tree/main/ckp), [TVTUNet](https://huggingface.co/Joypop/TVTSR/tree/main/ckp), [DAPE](https://huggingface.co/Joypop/TVTSR/tree/main/ckp), and [RAM](https://huggingface.co/Joypop/TVTSR/tree/main/ckp)) from [Hugging Face model weights](https://huggingface.co/Joypop/TVTSR/tree/main) and put them into the `ckp/` directory within your local `TVT` repository.
+### Step 2: Prepare testing data and run testing command
+You can modify `input_path` and `output_path` to run the testing command. The `input_path` is the path to your test image and `output_path` is the directory where the output images will be saved.
+```bash
+python TVT/inferences/inference.py \
+--input_image input_path \
+--output_dir output_path \
+--pretrained_path ckp/model_TVT.pkl \
+--pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base \
+--pretrained_unet_path ckp/TVTUNet \
+--vae4d_path ckp/vae.ckpt \
+--ram_ft_path ckp/DAPE.pth \
+--negprompt 'dotted, noise, blur, lowres, smooth' \
+--prompt 'clean, high-resolution, 8k' \
+--upscale 4 \
+--time_step 1
+```
+Alternatively, you can use the provided script:
+```bash
+bash scripts/test/test_realsr.sh
+```
+We also provide a tiled inference mode to save GPU memory. You can run the following command and adjust the `--tiled_size` and `--tiled_overlap` parameters based on your device's VRAM.
+```bash
+python TVT/inferences/inference_tile.py \
+--input_image input_path \
+--output_dir output_path \
+--pretrained_path ckp/model_TVT.pkl \
+--pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base \
+--pretrained_unet_path ckp/TVTUNet \
+--vae4d_path ckp/vae.ckpt \
+--ram_ft_path ckp/DAPE.pth \
+--negprompt 'dotted, noise, blur, lowres, smooth' \
+--prompt 'clean, high-resolution, 8k' \
+--upscale 4 \
+--time_step 1 \
+--tiled_size 96 \
+--tiled_overlap 32
+```
+## Citation
+If our code helps your research or work, please consider citing our paper:
+```bibtex
+@article{yi2025fine,
+  title={Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training},
+  author={Yi, Qiaosi and Li, Shuai and Wu, Rongyuan and Sun, Lingchen and Wu, Yuhui and Zhang, Lei},
+  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
+  year={2025}
+}
+```
+## License
+This project is released under the [Apache 2.0 license](LICENSE).
+## Acknowledgement
+This project is based on [diffusers](https://github.com/huggingface/diffusers), [LDM](https://github.com/CompVis/latent-diffusion), [OSEDiff](https://github.com/cswry/OSEDiff) and [PiSA-SR](https://github.com/csslc/PiSA-SR). Thanks for the awesome work.