| --- |
| license: cc-by-nc-sa-4.0 |
| tags: |
| - virtual-try-on |
| - diffusion |
| - image-to-image |
| - computer-vision |
| datasets: |
| - viton-hd |
| pipeline_tag: image-to-image |
| arxiv: 2605.01296 |
| --- |
| |
| # SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On |
|
|
| **ICPR 2026** |
|
|
| SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results. |
|
|
| Paper: [arXiv:2605.01296](https://arxiv.org/abs/2605.01296) |
|
|
| This model is derived from [StableVITON](https://github.com/rlawjdghek/stableviton) and built on a Stable Diffusion backbone. |
|
|
| The code repository is available at [takesukeDS/SIFT-VTON](https://github.com/takesukeDS/SIFT-VTON). |
|
|
| ## Model Files |
|
|
| | File | Description | |
| |---|---| |
| | `model.ckpt` | Model checkpoint | |
| | `config.yaml` | Model architecture config | |
|
|
| ## Requirements |
|
|
| Clone the code repository and set up the environment: |
|
|
| ```bash |
| git clone https://github.com/takesukeDS/SIFT-VTON |
| cd SIFT-VTON |
| |
| conda create -n siftvton python==3.12.8 -y |
| conda activate siftvton |
| |
| pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128 |
| pip install matplotlib einops omegaconf yacs |
| pip install pytorch-lightning==2.5.2 |
| pip install open-clip-torch==3.1.0 |
| pip install diffusers==0.34.0 |
| pip install scipy==1.16.1 |
| pip install transformers==4.55.0 |
| conda install -c anaconda ipython -y |
| pip install scikit-image clean-fid albumentations==2.0.8 |
| pip3 install -U xformers==0.0.31.post1 |
| pip install tensorboard |
| pip install accelerate==1.10.0 |
| pip install numpy==2.2.6 |
| ``` |
|
|
| ## Data |
|
|
| Download the [VITON-HD dataset](https://github.com/shadow2496/VITON-HD) and prepare the following directory structure: |
|
|
| ``` |
| [data_root_dir] |
| └── test |
| |-- image |
| |-- image-densepose |
| |-- agnostic-v3.2 |
| |-- agnostic-mask |
| |-- cloth |
| |-- cloth-mask |
| ``` |
|
|
| A pairs file `yahavton_test_pairs.txt` is also required under `[data_root_dir]`, listing image and cloth filenames one pair per line: |
| ``` |
| image_00001.jpg cloth_00001.jpg |
| image_00002.jpg cloth_00002.jpg |
| ... |
| ``` |
|
|
| ## Inference |
|
|
| ```bash |
| python inference_hf.py \ |
| --repo_id takesukeDS/SIFT-VTON \ |
| --data_root_dir [data_root_dir] \ |
| --save_dir [output_dir] \ |
| --phase test \ |
| --batch_size 4 \ |
| --start_from_noised_agn \ |
| --cfg_scale 1.5 \ |
| --repaint |
| ``` |
|
|
| The model and config are downloaded automatically from this Hub repository on the first run and cached locally under `~/.cache/huggingface/hub/`. |
|
|
| ### Key inference arguments |
|
|
| | Argument | Default | Description | |
| |---|---|---| |
| | `--repo_id` | — | This Hub repo (`takesukeDS/SIFT-VTON`) | |
| | `--phase` | `test` | `test` for the test split, `train` for the training split | |
| | `--cfg_scale` | `1.0` | Classifier-free guidance scale | |
| | `--denoise_steps` | `50` | Number of PLMS denoising steps | |
| | `--start_from_noised_agn` | off | Start denoising from noised agnostic image instead of pure noise (recommended) | |
| | `--repaint` | off | Paste back the unmasked region from the original image after generation (recommended) | |
| | `--unpair` | off | Run unpaired inference (person and garment from different samples) | |
| | `--batch_size` | `16` | Batch size | |
| | `--seed` | `1235` | Random seed | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{takemoto2026siftvton, |
| title = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On}, |
| author = {Takemoto, Kosuke and Koshinaka, Takafumi}, |
| year = {2026}, |
| eprint = {2605.01296}, |
| archivePrefix = {arXiv}, |
| primaryClass = {cs.CV}, |
| url = {https://arxiv.org/abs/2605.01296} |
| } |
| ``` |
|
|
| ## License |
|
|
| Licensed under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) license. |
|
|