Add `library_name` metadata tag
Browse filesThis PR enhances the model card by adding the `library_name: diffusers` metadata tag. This will enable the automatic "How to use" widget on the Hugging Face Hub, providing users with a ready-to-run code snippet for the model, as it is compatible with the `diffusers` library.
README.md
CHANGED
|
@@ -1,37 +1,22 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
| 5 |
pipeline_tag: image-to-image
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|
| 10 |
# End2End Virtual Tryon with Visual Reference
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
[](https://huggingface.co/qihoo360/EVTAR) [](https://arxiv.org/abs/2511.00956)
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |

|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
We propose **EVTAR**, an End-to-End Virtual Try-on model with Additional Visual Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance the model's ability to preserve and accurately depict clothing details.
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
## 💡 Github
|
| 30 |
[EVTAR](https://github.com/360CVGroup/EVTAR)
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
## 💡 Pretrained Models
|
| 36 |
|
| 37 |
We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
|
|
@@ -41,126 +26,43 @@ We provide pretrained backbone networks and LoRA weights for testing and deploym
|
|
| 41 |
|
| 42 |
1024_768_pytorch_lora_weights.safetensors:1024x768 resolution high-quality virtual fitting model
|
| 43 |
✅ **Available**
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
|
| 48 |
## 💡 Update
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
- [x] [2025.10.11] Release the virtual try-on inference code and LoRA weights.
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
- [x] [2025.10.13] Release the technical report on Arxiv.
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
## 💪 Highlight Feature
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
- **And End-To-End virtual try-on model:** Can function either as an inpainting model for placing the target clothing into masked areas, or as a direct garment transfer onto the human body.
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
|
| 82 |
-
-
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
- **Improved Performance** Our model achieves state-of-the-art performance on public benchmarks and demonstrates strong generalization ability to in-the-wild inputs.
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
|
| 96 |
## 🧩 Environment Setup
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
```
|
| 105 |
conda create -n EVTAR python=3.12 -y
|
| 106 |
conda activate EVTAR
|
| 107 |
pip install -r requirements.txt
|
| 108 |
```
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
## 📂 Preparation of Dataset and Pretrained Models
|
| 117 |
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
### Dataset
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
Currently, we provide a small test set with additional reference images "difference person wearing the target cloth" for trying our model. We plan to release the reference data generation code, along with our proposed full dataset containing model reference images, in the future.
|
| 133 |
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
Nevertheless, inference can still be performed in a reference-free setting on public benchmarks, including [VITON-HD](https://github.com/shadow2496/VITON-HD) and [DressCode](https://github.com/aimagelab/dress-code).
|
| 139 |
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
### Reference Data Preparation
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
One key feature of our method is the use of _reference data_, where an image of a different person wearing the target garment is provided to help the model imagine how the target person would look in that garment. In most online shopping applications, such additonal reference images are commonly used by customers to better visualize the clothing. However, publicly available datasets such as VITON-HD and DressCode do not include such reference data, so we generate them ourselves.
|
| 151 |
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
Please prepare the pretrained weights of the Flux-Kontext model and the Qwen2.5-VL-32B model. And you can generate the additonal reference image using the following commands:
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
```
|
| 165 |
accelerate launch --num_processes 8 --main_process_port 29500 generate_reference.py \
|
| 166 |
--instance_data_dir "path_to_your_datasets" \
|
|
@@ -169,40 +71,14 @@ accelerate launch --num_processes 8 --main_process_port 29500 generate_reference
|
|
| 169 |
--desc_path "desc.json"
|
| 170 |
```
|
| 171 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
### Pretrained Models
|
| 179 |
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
|
| 185 |
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
## ⏳ Inference Pipeline
|
| 193 |
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
Here we provide the inference code for our EVTAR.
|
| 201 |
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
```
|
| 207 |
accelerate launch --num_processes 8 --main_process_port 29500 inference.py \
|
| 208 |
|
|
@@ -221,62 +97,24 @@ accelerate launch --num_processes 8 --main_process_port 29500 inference.py \
|
|
| 221 |
--use_person
|
| 222 |
```
|
| 223 |
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
- `pretrained_model_name_or_path`: Path to the downloaded Flux-Kontext model weights.
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
- `instance_data_dir`: Path to your dataset. For inference on VITON-HD or DressCode, ensure that the words "viton" or "DressCode" appear in the path.
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
- `output_dir`: Path to the downloaded or trained LoRA weights.
|
| 241 |
-
|
| 242 |
-
|
| 243 |
|
| 244 |
-
|
| 245 |
|
| 246 |
-
-
|
| 247 |
|
| 248 |
-
|
| 249 |
|
| 250 |
-
|
| 251 |
|
| 252 |
-
-
|
| 253 |
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
- `use_different`: **Only applicable for VITON/DressCode inference.** Whether to use different cloth-person pairs.
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
- `use_person`: **Only applicable for VITON/DressCode inference.** Whether to use the unmasked person image instead of the agnostic masked image as input for the virtual try-on task.
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
|
| 270 |
## 📊 Evaluation
|
| 271 |
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
We quantitatively evaluate the quality of virtual try-on results using the FID, KID, SSIM, and LPIPS. Here, we provide the evaluation code for the VITON-HD and DressCode datasets.
|
| 277 |
|
| 278 |
-
|
| 279 |
-
|
| 280 |
```
|
| 281 |
# Evaluation on VITON-HD dataset
|
| 282 |
|
|
@@ -287,12 +125,6 @@ CUDA_VISIBLE_DEVICES=0 python eval_dresscode.py \
|
|
| 287 |
--paired
|
| 288 |
```
|
| 289 |
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
```
|
| 297 |
# Evaluation on DressCode dataset
|
| 298 |
|
|
@@ -302,45 +134,22 @@ CUDA_VISIBLE_DEVICES=0 python eval.py \
|
|
| 302 |
--pred_folder_base [[path_to_your_generated_image_folder]]\
|
| 303 |
```
|
| 304 |
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
- `paired`: If you perform unpaired generation, where different garments are fitted onto the target person, you should enable this flag during evaluation.
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
|
| 314 |
Evaluation result on VITON-HD dataset:
|
| 315 |
|
| 316 |

|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
Evaluation result on DressCode dataset:
|
| 322 |
|
| 323 |

|
| 324 |
|
| 325 |
-
|
| 326 |
-
|
| 327 |
## 🌸 Acknowledgement
|
| 328 |
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
This code is mainly built upon [Diffusers](https://github.com/huggingface/diffusers/tree/main), [Flux](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/flux), and [CatVTON](https://github.com/Zheng-Chong/CatVTON/) repositories. Thanks so much for their solid work!
|
| 334 |
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
## 💖 Citation
|
| 342 |
|
| 343 |
-
|
| 344 |
If you find this repository useful, please consider citing our paper:
|
| 345 |
```
|
| 346 |
@misc{li2025evtarendtoendtryadditional,
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
pipeline_tag: image-to-image
|
| 6 |
+
library_name: diffusers
|
| 7 |
---
|
| 8 |
|
|
|
|
|
|
|
| 9 |
# End2End Virtual Tryon with Visual Reference
|
| 10 |
|
|
|
|
|
|
|
| 11 |
[](https://huggingface.co/qihoo360/EVTAR) [](https://arxiv.org/abs/2511.00956)
|
| 12 |
|
|
|
|
|
|
|
| 13 |

|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
We propose **EVTAR**, an End-to-End Virtual Try-on model with Additional Visual Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance the model's ability to preserve and accurately depict clothing details.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
| 17 |
## 💡 Github
|
| 18 |
[EVTAR](https://github.com/360CVGroup/EVTAR)
|
| 19 |
|
|
|
|
|
|
|
|
|
|
| 20 |
## 💡 Pretrained Models
|
| 21 |
|
| 22 |
We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
|
|
|
|
| 26 |
|
| 27 |
1024_768_pytorch_lora_weights.safetensors:1024x768 resolution high-quality virtual fitting model
|
| 28 |
✅ **Available**
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## 💡 Update
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
- [x] [2025.10.11] Release the virtual try-on inference code and LoRA weights.
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
- [x] [2025.10.13] Release the technical report on Arxiv.
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## 💪 Highlight Feature
|
| 37 |
|
| 38 |
+
- **And End-To-End virtual try-on model:** Can function either as an inpainting model for placing the target clothing into masked areas, or as a direct garment transfer onto the human body.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
- **Using Reference Image To Enhance the Try-on Performance:** To emulate human attention on the overall wearing effect rather than the garment itself when shopping online, our model allows using images of a model wearing the target clothing as input, thereby better preserving its material texture and design details.
|
| 41 |
|
| 42 |
+
- **Improved Performance** Our model achieves state-of-the-art performance on public benchmarks and demonstrates strong generalization ability to in-the-wild inputs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## 🧩 Environment Setup
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
```
|
| 47 |
conda create -n EVTAR python=3.12 -y
|
| 48 |
conda activate EVTAR
|
| 49 |
pip install -r requirements.txt
|
| 50 |
```
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
## 📂 Preparation of Dataset and Pretrained Models
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
### Dataset
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
Currently, we provide a small test set with additional reference images "difference person wearing the target cloth" for trying our model. We plan to release the reference data generation code, along with our proposed full dataset containing model reference images, in the future.
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
Nevertheless, inference can still be performed in a reference-free setting on public benchmarks, including [VITON-HD](https://github.com/shadow2496/VITON-HD) and [DressCode](https://github.com/aimagelab/dress-code).
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
### Reference Data Preparation
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
One key feature of our method is the use of _reference data_, where an image of a different person wearing the target garment is provided to help the model imagine how the target person would look in that garment. In most online shopping applications, such additonal reference images are commonly used by customers to better visualize the clothing. However, publicly available datasets such as VITON-HD and DressCode do not include such reference data, so we generate them ourselves.
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
Please prepare the pretrained weights of the Flux-Kontext model and the Qwen2.5-VL-32B model. And you can generate the additonal reference image using the following commands:
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
```
|
| 67 |
accelerate launch --num_processes 8 --main_process_port 29500 generate_reference.py \
|
| 68 |
--instance_data_dir "path_to_your_datasets" \
|
|
|
|
| 71 |
--desc_path "desc.json"
|
| 72 |
```
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
### Pretrained Models
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
## ⏳ Inference Pipeline
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
Here we provide the inference code for our EVTAR.
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
```
|
| 83 |
accelerate launch --num_processes 8 --main_process_port 29500 inference.py \
|
| 84 |
|
|
|
|
| 97 |
--use_person
|
| 98 |
```
|
| 99 |
|
| 100 |
+
- `pretrained_model_name_or_path`: Path to the downloaded Flux-Kontext model weights.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
- `instance_data_dir`: Path to your dataset. For inference on VITON-HD or DressCode, ensure that the words "viton" or "DressCode" appear in the path.
|
| 103 |
|
| 104 |
+
- `output_dir`: Path to the downloaded or trained LoRA weights.
|
| 105 |
|
| 106 |
+
- `cond_scale`: Resize scale of the reference image during training. Defaults to `1.0` for $512\times384$ and `2.0` for $1024\times768$ resolution.
|
| 107 |
|
| 108 |
+
- `use_reference`: Whether to use a additonal reference image as input.
|
| 109 |
|
| 110 |
+
- `use_different`: **Only applicable for VITON/DressCode inference.** Whether to use different cloth-person pairs.
|
| 111 |
|
| 112 |
+
- `use_person`: **Only applicable for VITON/DressCode inference.** Whether to use the unmasked person image instead of the agnostic masked image as input for the virtual try-on task.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
## 📊 Evaluation
|
| 115 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
We quantitatively evaluate the quality of virtual try-on results using the FID, KID, SSIM, and LPIPS. Here, we provide the evaluation code for the VITON-HD and DressCode datasets.
|
| 117 |
|
|
|
|
|
|
|
| 118 |
```
|
| 119 |
# Evaluation on VITON-HD dataset
|
| 120 |
|
|
|
|
| 125 |
--paired
|
| 126 |
```
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
```
|
| 129 |
# Evaluation on DressCode dataset
|
| 130 |
|
|
|
|
| 134 |
--pred_folder_base [[path_to_your_generated_image_folder]]\
|
| 135 |
```
|
| 136 |
|
| 137 |
+
- `paired`: If you perform unpaired generation, where different garments are fitted onto the target person, you should enable this flag during evaluation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
Evaluation result on VITON-HD dataset:
|
| 140 |
|
| 141 |

|
| 142 |
|
|
|
|
|
|
|
|
|
|
| 143 |
Evaluation result on DressCode dataset:
|
| 144 |
|
| 145 |

|
| 146 |
|
|
|
|
|
|
|
| 147 |
## 🌸 Acknowledgement
|
| 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
This code is mainly built upon [Diffusers](https://github.com/huggingface/diffusers/tree/main), [Flux](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/flux), and [CatVTON](https://github.com/Zheng-Chong/CatVTON/) repositories. Thanks so much for their solid work!
|
| 150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
## 💖 Citation
|
| 152 |
|
|
|
|
| 153 |
If you find this repository useful, please consider citing our paper:
|
| 154 |
```
|
| 155 |
@misc{li2025evtarendtoendtryadditional,
|