Upload README.md with huggingface_hub

e8418e3 verified 4 days ago

3.84 kB

	---
	license: cc-by-nc-sa-4.0
	tags:
	- virtual-try-on
	- diffusion
	- image-to-image
	- computer-vision
	datasets:
	- viton-hd
	pipeline_tag: image-to-image
	arxiv: 2605.01296
	---

	# SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

	ICPR 2026

	SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results.

	Paper: [arXiv:2605.01296](https://arxiv.org/abs/2605.01296)

	This model is derived from [StableVITON](https://github.com/rlawjdghek/stableviton) and built on a Stable Diffusion backbone.

	The code repository is available at [takesukeDS/SIFT-VTON](https://github.com/takesukeDS/SIFT-VTON).

	## Model Files

	\| File \| Description \|
	\|---\|---\|
	\| `model.ckpt` \| Model checkpoint \|
	\| `config.yaml` \| Model architecture config \|

	## Requirements

	Clone the code repository and set up the environment:

	```bash
	git clone https://github.com/takesukeDS/SIFT-VTON
	cd SIFT-VTON

	conda create -n siftvton python==3.12.8 -y
	conda activate siftvton

	pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
	pip install matplotlib einops omegaconf yacs
	pip install pytorch-lightning==2.5.2
	pip install open-clip-torch==3.1.0
	pip install diffusers==0.34.0
	pip install scipy==1.16.1
	pip install transformers==4.55.0
	conda install -c anaconda ipython -y
	pip install scikit-image clean-fid albumentations==2.0.8
	pip3 install -U xformers==0.0.31.post1
	pip install tensorboard
	pip install accelerate==1.10.0
	pip install numpy==2.2.6
	```

	## Data

	Download the [VITON-HD dataset](https://github.com/shadow2496/VITON-HD) and prepare the following directory structure:

	```
	[data_root_dir]
	└── test
	\|-- image
	\|-- image-densepose
	\|-- agnostic-v3.2
	\|-- agnostic-mask
	\|-- cloth
	\|-- cloth-mask
	```

	A pairs file `yahavton_test_pairs.txt` is also required under `[data_root_dir]`, listing image and cloth filenames one pair per line:
	```
	image_00001.jpg cloth_00001.jpg
	image_00002.jpg cloth_00002.jpg
	...
	```

	## Inference

	```bash
	python inference_hf.py \
	--repo_id takesukeDS/SIFT-VTON \
	--data_root_dir [data_root_dir] \
	--save_dir [output_dir] \
	--phase test \
	--batch_size 4 \
	--start_from_noised_agn \
	--cfg_scale 1.5 \
	--repaint
	```

	The model and config are downloaded automatically from this Hub repository on the first run and cached locally under `~/.cache/huggingface/hub/`.

	### Key inference arguments

	\| Argument \| Default \| Description \|
	\|---\|---\|---\|
	\| `--repo_id` \| — \| This Hub repo (`takesukeDS/SIFT-VTON`) \|
	\| `--phase` \| `test` \| `test` for the test split, `train` for the training split \|
	\| `--cfg_scale` \| `1.0` \| Classifier-free guidance scale \|
	\| `--denoise_steps` \| `50` \| Number of PLMS denoising steps \|
	\| `--start_from_noised_agn` \| off \| Start denoising from noised agnostic image instead of pure noise (recommended) \|
	\| `--repaint` \| off \| Paste back the unmasked region from the original image after generation (recommended) \|
	\| `--unpair` \| off \| Run unpaired inference (person and garment from different samples) \|
	\| `--batch_size` \| `16` \| Batch size \|
	\| `--seed` \| `1235` \| Random seed \|

	## Citation

	```bibtex
	@misc{takemoto2026siftvton,
	title = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On},
	author = {Takemoto, Kosuke and Koshinaka, Takafumi},
	year = {2026},
	eprint = {2605.01296},
	archivePrefix = {arXiv},
	primaryClass = {cs.CV},
	url = {https://arxiv.org/abs/2605.01296}
	}
	```

	## License

	Licensed under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) license.