SIFT-VTON / README.md
takesuke's picture
Upload README.md with huggingface_hub
e8418e3 verified
---
license: cc-by-nc-sa-4.0
tags:
- virtual-try-on
- diffusion
- image-to-image
- computer-vision
datasets:
- viton-hd
pipeline_tag: image-to-image
arxiv: 2605.01296
---
# SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On
**ICPR 2026**
SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results.
Paper: [arXiv:2605.01296](https://arxiv.org/abs/2605.01296)
This model is derived from [StableVITON](https://github.com/rlawjdghek/stableviton) and built on a Stable Diffusion backbone.
The code repository is available at [takesukeDS/SIFT-VTON](https://github.com/takesukeDS/SIFT-VTON).
## Model Files
| File | Description |
|---|---|
| `model.ckpt` | Model checkpoint |
| `config.yaml` | Model architecture config |
## Requirements
Clone the code repository and set up the environment:
```bash
git clone https://github.com/takesukeDS/SIFT-VTON
cd SIFT-VTON
conda create -n siftvton python==3.12.8 -y
conda activate siftvton
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
pip install matplotlib einops omegaconf yacs
pip install pytorch-lightning==2.5.2
pip install open-clip-torch==3.1.0
pip install diffusers==0.34.0
pip install scipy==1.16.1
pip install transformers==4.55.0
conda install -c anaconda ipython -y
pip install scikit-image clean-fid albumentations==2.0.8
pip3 install -U xformers==0.0.31.post1
pip install tensorboard
pip install accelerate==1.10.0
pip install numpy==2.2.6
```
## Data
Download the [VITON-HD dataset](https://github.com/shadow2496/VITON-HD) and prepare the following directory structure:
```
[data_root_dir]
└── test
|-- image
|-- image-densepose
|-- agnostic-v3.2
|-- agnostic-mask
|-- cloth
|-- cloth-mask
```
A pairs file `yahavton_test_pairs.txt` is also required under `[data_root_dir]`, listing image and cloth filenames one pair per line:
```
image_00001.jpg cloth_00001.jpg
image_00002.jpg cloth_00002.jpg
...
```
## Inference
```bash
python inference_hf.py \
--repo_id takesukeDS/SIFT-VTON \
--data_root_dir [data_root_dir] \
--save_dir [output_dir] \
--phase test \
--batch_size 4 \
--start_from_noised_agn \
--cfg_scale 1.5 \
--repaint
```
The model and config are downloaded automatically from this Hub repository on the first run and cached locally under `~/.cache/huggingface/hub/`.
### Key inference arguments
| Argument | Default | Description |
|---|---|---|
| `--repo_id` | — | This Hub repo (`takesukeDS/SIFT-VTON`) |
| `--phase` | `test` | `test` for the test split, `train` for the training split |
| `--cfg_scale` | `1.0` | Classifier-free guidance scale |
| `--denoise_steps` | `50` | Number of PLMS denoising steps |
| `--start_from_noised_agn` | off | Start denoising from noised agnostic image instead of pure noise (recommended) |
| `--repaint` | off | Paste back the unmasked region from the original image after generation (recommended) |
| `--unpair` | off | Run unpaired inference (person and garment from different samples) |
| `--batch_size` | `16` | Batch size |
| `--seed` | `1235` | Random seed |
## Citation
```bibtex
@misc{takemoto2026siftvton,
title = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On},
author = {Takemoto, Kosuke and Koshinaka, Takafumi},
year = {2026},
eprint = {2605.01296},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.01296}
}
```
## License
Licensed under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) license.