---
license: cc-by-nc-sa-4.0
tags:
  - virtual-try-on
  - diffusion
  - image-to-image
  - computer-vision
datasets:
  - viton-hd
pipeline_tag: image-to-image
arxiv: 2605.01296
---

# SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

**ICPR 2026**

SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results.

Paper: [arXiv:2605.01296](https://arxiv.org/abs/2605.01296)

This model is derived from [StableVITON](https://github.com/rlawjdghek/stableviton) and built on a Stable Diffusion backbone.

The code repository is available at [takesukeDS/SIFT-VTON](https://github.com/takesukeDS/SIFT-VTON).

## Model Files

| File | Description |
|---|---|
| `model.ckpt` | Model checkpoint |
| `config.yaml` | Model architecture config |

## Requirements

Clone the code repository and set up the environment:

```bash
git clone https://github.com/takesukeDS/SIFT-VTON
cd SIFT-VTON

conda create -n siftvton python==3.12.8 -y
conda activate siftvton

pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
pip install matplotlib einops omegaconf yacs
pip install pytorch-lightning==2.5.2
pip install open-clip-torch==3.1.0
pip install diffusers==0.34.0
pip install scipy==1.16.1
pip install transformers==4.55.0
conda install -c anaconda ipython -y
pip install scikit-image clean-fid albumentations==2.0.8
pip3 install -U xformers==0.0.31.post1
pip install tensorboard
pip install accelerate==1.10.0
pip install numpy==2.2.6
```

## Data

Download the [VITON-HD dataset](https://github.com/shadow2496/VITON-HD) and prepare the following directory structure:

```
[data_root_dir]
└── test
    |-- image
    |-- image-densepose
    |-- agnostic-v3.2
    |-- agnostic-mask
    |-- cloth
    |-- cloth-mask
```

A pairs file `yahavton_test_pairs.txt` is also required under `[data_root_dir]`, listing image and cloth filenames one pair per line:
```
image_00001.jpg cloth_00001.jpg
image_00002.jpg cloth_00002.jpg
...
```

## Inference

```bash
python inference_hf.py \
    --repo_id takesukeDS/SIFT-VTON \
    --data_root_dir [data_root_dir] \
    --save_dir [output_dir] \
    --phase test \
    --batch_size 4 \
    --start_from_noised_agn \
    --cfg_scale 1.5 \
    --repaint
```

The model and config are downloaded automatically from this Hub repository on the first run and cached locally under `~/.cache/huggingface/hub/`.

### Key inference arguments

| Argument | Default | Description |
|---|---|---|
| `--repo_id` | — | This Hub repo (`takesukeDS/SIFT-VTON`) |
| `--phase` | `test` | `test` for the test split, `train` for the training split |
| `--cfg_scale` | `1.0` | Classifier-free guidance scale |
| `--denoise_steps` | `50` | Number of PLMS denoising steps |
| `--start_from_noised_agn` | off | Start denoising from noised agnostic image instead of pure noise (recommended) |
| `--repaint` | off | Paste back the unmasked region from the original image after generation (recommended) |
| `--unpair` | off | Run unpaired inference (person and garment from different samples) |
| `--batch_size` | `16` | Batch size |
| `--seed` | `1235` | Random seed |

## Citation

```bibtex
@misc{takemoto2026siftvton,
  title         = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On},
  author        = {Takemoto, Kosuke and Koshinaka, Takafumi},
  year          = {2026},
  eprint        = {2605.01296},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.01296}
}
```

## License

Licensed under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) license.