Image-to-Image
English
liushanyuan18 commited on
Commit
5402766
·
verified ·
1 Parent(s): f6c80ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -125
README.md CHANGED
@@ -1,20 +1,31 @@
1
 
 
 
2
  # End2End Virtual Tryon with Visual Reference
3
 
4
- [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-EVTAR-ff9900?style=for-the-badge&logo=huggingface)](https://huggingface.co/qihoo360/EVTAR) [![arXiv](https://img.shields.io/badge/arXiv-2101.00001-B31B1B?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2101.00001)
 
 
 
5
 
6
 
7
  ![examples](examples.png)
8
 
9
 
10
 
 
 
11
  We propose **EVTAR**, an End-to-End Virtual Try-on model with Additional Visual Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance the model's ability to preserve and accurately depict clothing details.
12
 
13
 
 
 
14
  ## 💡 Github
15
  [EVTAR](https://github.com/360CVGroup/EVTAR)
16
 
17
 
 
 
18
  ## 💡 Pretrained Models
19
 
20
  We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
@@ -24,6 +35,7 @@ We provide pretrained backbone networks and LoRA weights for testing and deploym
24
 
25
  1024_768_pytorch_lora_weights.safetensors:1024x768 resolution high-quality virtual fitting model
26
  🔒 **Coming Soon**
 
27
 
28
 
29
 
@@ -31,56 +43,66 @@ We provide pretrained backbone networks and LoRA weights for testing and deploym
31
 
32
 
33
 
 
 
34
  - [x] [2025.10.11] Release the virtual try-on inference code and LoRA weights.
35
 
36
 
37
 
 
 
38
  - [x] [2025.10.13] Release the technical report on Arxiv.
39
 
40
 
41
 
42
 
43
 
 
 
44
  ## 💪 Highlight Feature
45
 
46
 
47
 
48
 
49
 
 
 
50
  - **And End-To-End virtual try-on model:** Can function either as an inpainting model for placing the target clothing into masked areas, or as a direct garment transfer onto the human body.
51
 
52
 
53
 
 
 
54
  - **Using Reference Image To Enhance the Try-on Performance:** To emulate human attention on the overall wearing effect rather than the garment itself when shopping online, our model allows using images of a model wearing the target clothing as input, thereby better preserving its material texture and design details.
55
 
56
 
57
 
 
 
58
  - **Improved Performance** Our model achieves state-of-the-art performance on public benchmarks and demonstrates strong generalization ability to in-the-wild inputs.
59
 
60
 
61
 
62
 
63
 
 
 
64
  ## 🧩 Environment Setup
65
 
66
 
67
 
68
 
69
 
70
- ```
71
-
72
 
73
 
 
74
  conda create -n EVTAR python=3.12 -y
75
-
76
  conda activate EVTAR
77
-
78
  pip install -r requirements.txt
 
79
 
80
 
81
 
82
- ```
83
-
84
 
85
 
86
 
@@ -91,13 +113,19 @@ pip install -r requirements.txt
91
 
92
 
93
 
 
 
94
  ### Dataset
95
 
96
 
97
 
98
 
99
 
100
- Currently, we provide a small test set with reference images for trying our model. We plan to release the reference data generation code, along with our proposed full dataset containing model reference images, in the future.
 
 
 
 
101
 
102
 
103
 
@@ -105,38 +133,38 @@ Nevertheless, inference can still be performed in a reference-free setting on pu
105
 
106
 
107
 
 
 
108
  ### Reference Data Preparation
109
 
110
 
111
 
112
- One key feature of our method is the use of _reference data_, where an image of a different person wearing the target garment is provided to help the model imagine how the target person would look in that garment. In most online shopping applications, such reference images are commonly used by customers to better visualize the clothing. However, publicly available datasets such as VITON-HD and DressCode do not include such reference data, so we generate them ourselves.
113
-
114
 
115
 
 
 
116
 
117
 
118
- Please prepare the pretrained weights of the Flux-Kontext model and the Qwen2.5-VL-32B model. And you can generate the reference image using the following commands:
119
 
120
 
121
 
122
- ```
123
 
124
 
125
 
126
- accelerate launch --num_processes 8 --main_process_port 29500 generate_reference.py \
127
 
 
 
128
  --instance_data_dir "path_to_your_datasets" \
129
-
130
  --inference_batch_size 1 \
131
-
132
  --split "train" \
133
-
134
  --desc_path "desc.json"
 
135
 
136
 
137
 
138
- ```
139
-
140
 
141
 
142
 
@@ -145,199 +173,111 @@ accelerate launch --num_processes 8 --main_process_port 29500 generate_reference
145
 
146
 
147
 
 
 
148
  We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
149
 
150
 
151
 
152
 
153
 
 
 
154
  ## ⏳ Inference Pipeline
155
 
156
 
157
 
158
 
159
 
 
 
160
  Here we provide the inference code for our EVTAR.
161
 
162
 
163
 
164
- ```
165
-
166
 
167
 
 
168
  accelerate launch --num_processes 8 --main_process_port 29500 inference.py \
169
 
170
  --pretrained_model_name_or_path="[path_to_your_Flux_model]" \
171
-
172
  --instance_data_dir="[your_data_directory]" \
173
-
174
  --output_dir="[Path_to_LoRA_weights]" \
175
-
176
  --mixed_precision="bf16" \
177
-
178
  --split="test" \
179
-
180
  --height=1024 \
181
-
182
  --width=768 \
183
-
184
  --inference_batch_size=1 \
185
-
186
  --cond_scale=2 \
187
-
188
  --seed="0" \
189
-
190
  --use_reference \
191
-
192
  --use_different \
193
-
194
  --use_person
195
-
196
-
197
-
198
  ```
199
 
200
 
201
 
202
- - `pretrained_model_name_or_path`: Path to the downloaded Flux-Kontext model weights.
203
-
204
-
205
-
206
- - `instance_data_dir`: Path to your dataset. For inference on VITON-HD or DressCode, ensure that the words "viton" or "DressCode" appear in the path.
207
-
208
 
209
 
210
- - `output_dir`: Path to the downloaded or trained LoRA weights.
211
 
212
 
213
 
214
- - `cond_scale`: Resize scale of the reference image during training. Defaults to `1.0` for $512\times384$ and `2.0` for $1024\times768$ resolution.
215
-
216
 
217
 
218
- - `use_reference`: Whether to use a reference model image.
219
 
220
 
221
 
222
- - `use_different`: **Only applicable for VITON/DressCode inference.** Whether to use different cloth-person pairs.
223
-
224
 
225
 
226
- - `use_person`: **Only applicable for VITON/DressCode inference.** Whether to use the unmasked person image instead of the agnostic masked image as input for the virtual try-on task.
227
 
228
 
229
 
230
 
231
 
232
- ## 🚀 Training Pipeline
233
 
234
 
235
 
236
- After the preparation of datasets, you can training the virtual-try-on model using following code.
237
-
238
 
239
 
240
- ```
241
 
242
 
243
 
244
- accelerate launch --num_processes 8 --main_process_port 29501\
245
-
246
 
247
 
248
- train_lora_flux_kontext_1st_stage.py \
249
-
250
- --pretrained_model_name_or_path="[path_to_your_Flux_model]" \
251
-
252
- --instance_data_dir="[path_to_your_datasets]" \
253
-
254
- --split="train" \
255
-
256
- --output_dir="[path_to_save_your_LoRA_weights]" \
257
-
258
- --mixed_precision="bf16" \
259
-
260
- --height=1024 \
261
-
262
- --width=768 \
263
-
264
- --train_batch_size=8 \
265
-
266
- --guidance_scale=1 \
267
-
268
- --gradient_checkpointing \
269
-
270
- --optimizer="adamw" \
271
-
272
- --rank=64 \
273
-
274
- --lora_alpha=128 \
275
-
276
- --use_8bit_adam \
277
-
278
- --learning_rate=1e-4 \
279
-
280
- --lr_scheduler="constant" \
281
-
282
- --lr_warmup_steps=0 \
283
-
284
- --num_train_epochs=64 \
285
-
286
- --cond_scale=2 \
287
-
288
- --seed="0" \
289
-
290
- --dropout_reference=0.5 \
291
-
292
- --person_prob=0.5
293
-
294
-
295
-
296
- ```
297
-
298
-
299
-
300
- - `cond_scale`: Scaling factor for resizing the reference image during training.
301
-
302
-
303
-
304
- Defaults to `1.0` for $512\times384$ resolution and `2.0` for $1024\times768$.
305
 
306
 
307
 
308
- - `dropout_reference`: Probability of using reference images in each training iteration.
309
-
310
 
311
 
312
- When not selected, the iteration proceeds **without** reference images.
313
 
314
 
315
 
316
- - `person_prob`: Probability of using unmasked person images in each training iteration.
317
-
318
 
319
 
320
- Otherwise, the iteration uses **agnostic images**, where the target clothing region is masked out.
321
 
322
 
323
 
324
 
325
 
326
- ## 📊 Evaluation
327
 
328
 
329
 
330
- We quantitatively evaluate the quality of virtual try-on results using the FID, KID, SSIM, and LPIPS. Here, we provide the evaluation code for the VITON-HD and DressCode datasets.
331
-
332
  ```
333
  # Evaluation on VITON-HD dataset
334
 
335
  CUDA_VISIBLE_DEVICES=0 python eval_dresscode.py \
336
 
337
  --gt_folder_base [path_to_your_ground_truth_image_folder] \
338
-
339
  --pred_folder_base [[path_to_your_generated_image_folder]]\
340
-
341
  --paired
342
  ```
343
 
@@ -345,33 +285,48 @@ CUDA_VISIBLE_DEVICES=0 python eval_dresscode.py \
345
 
346
 
347
 
 
 
348
  ```
349
  # Evaluation on DressCode dataset
350
 
351
  CUDA_VISIBLE_DEVICES=0 python eval.py \
352
 
353
  --gt_folder_base [path_to_your_ground_truth_image_folder] \
354
-
355
  --pred_folder_base [[path_to_your_generated_image_folder]]\
356
  ```
357
 
358
 
359
 
 
 
360
  - `paired`: If you perform unpaired generation, where different garments are fitted onto the target person, you should enable this flag during evaluation.
361
 
 
 
362
 
363
  Evaluation result on VITON-HD dataset:
 
364
  ![examples](VITON_results.png)
365
 
 
 
366
 
367
  Evaluation result on DressCode dataset:
 
368
  ![examples](DressCode_results.png)
369
 
 
 
370
  ## 🌸 Acknowledgement
371
 
372
 
373
 
374
- This code is mainly built upon [diffusers](https://github.com/huggingface/diffusers/tree/main), [Flux](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/flux), and [CatVTON](https://github.com/Zheng-Chong/CatVTON/) repositories. Thanks so much for their solid work!
 
 
 
 
375
 
376
 
377
 
@@ -379,7 +334,6 @@ This code is mainly built upon [diffusers](https://github.com/huggingface/diffus
379
 
380
  ## 💖 Citation
381
 
382
-
383
 
384
  If you find this repository useful, please consider citing our paper:
385
  ```
 
1
 
2
+
3
+
4
  # End2End Virtual Tryon with Visual Reference
5
 
6
+
7
+
8
+ [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-EVTAR-ff9900?style=flat)](https://huggingface.co/qihoo360/EVTAR) [![arXiv](https://img.shields.io/badge/arXiv-2101.00001-B31B1B?style=flat)](https://arxiv.org/abs/2511.00956)
9
+
10
 
11
 
12
  ![examples](examples.png)
13
 
14
 
15
 
16
+
17
+
18
  We propose **EVTAR**, an End-to-End Virtual Try-on model with Additional Visual Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance the model's ability to preserve and accurately depict clothing details.
19
 
20
 
21
+
22
+
23
  ## 💡 Github
24
  [EVTAR](https://github.com/360CVGroup/EVTAR)
25
 
26
 
27
+
28
+
29
  ## 💡 Pretrained Models
30
 
31
  We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
 
35
 
36
  1024_768_pytorch_lora_weights.safetensors:1024x768 resolution high-quality virtual fitting model
37
  🔒 **Coming Soon**
38
+
39
 
40
 
41
 
 
43
 
44
 
45
 
46
+
47
+
48
  - [x] [2025.10.11] Release the virtual try-on inference code and LoRA weights.
49
 
50
 
51
 
52
+
53
+
54
  - [x] [2025.10.13] Release the technical report on Arxiv.
55
 
56
 
57
 
58
 
59
 
60
+
61
+
62
  ## 💪 Highlight Feature
63
 
64
 
65
 
66
 
67
 
68
+
69
+
70
  - **And End-To-End virtual try-on model:** Can function either as an inpainting model for placing the target clothing into masked areas, or as a direct garment transfer onto the human body.
71
 
72
 
73
 
74
+
75
+
76
  - **Using Reference Image To Enhance the Try-on Performance:** To emulate human attention on the overall wearing effect rather than the garment itself when shopping online, our model allows using images of a model wearing the target clothing as input, thereby better preserving its material texture and design details.
77
 
78
 
79
 
80
+
81
+
82
  - **Improved Performance** Our model achieves state-of-the-art performance on public benchmarks and demonstrates strong generalization ability to in-the-wild inputs.
83
 
84
 
85
 
86
 
87
 
88
+
89
+
90
  ## 🧩 Environment Setup
91
 
92
 
93
 
94
 
95
 
 
 
96
 
97
 
98
+ ```
99
  conda create -n EVTAR python=3.12 -y
 
100
  conda activate EVTAR
 
101
  pip install -r requirements.txt
102
+ ```
103
 
104
 
105
 
 
 
106
 
107
 
108
 
 
113
 
114
 
115
 
116
+
117
+
118
  ### Dataset
119
 
120
 
121
 
122
 
123
 
124
+
125
+
126
+ Currently, we provide a small test set with additional reference images "difference person wearing the target cloth" for trying our model. We plan to release the reference data generation code, along with our proposed full dataset containing model reference images, in the future.
127
+
128
+
129
 
130
 
131
 
 
133
 
134
 
135
 
136
+
137
+
138
  ### Reference Data Preparation
139
 
140
 
141
 
 
 
142
 
143
 
144
+ One key feature of our method is the use of _reference data_, where an image of a different person wearing the target garment is provided to help the model imagine how the target person would look in that garment. In most online shopping applications, such additonal reference images are commonly used by customers to better visualize the clothing. However, publicly available datasets such as VITON-HD and DressCode do not include such reference data, so we generate them ourselves.
145
+
146
 
147
 
148
+
149
 
150
 
151
 
152
+ Please prepare the pretrained weights of the Flux-Kontext model and the Qwen2.5-VL-32B model. And you can generate the additonal reference image using the following commands:
153
 
154
 
155
 
156
+
157
 
158
+ ```
159
+ accelerate launch --num_processes 8 --main_process_port 29500 generate_reference.py \
160
  --instance_data_dir "path_to_your_datasets" \
 
161
  --inference_batch_size 1 \
 
162
  --split "train" \
 
163
  --desc_path "desc.json"
164
+ ```
165
 
166
 
167
 
 
 
168
 
169
 
170
 
 
173
 
174
 
175
 
176
+
177
+
178
  We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.
179
 
180
 
181
 
182
 
183
 
184
+
185
+
186
  ## ⏳ Inference Pipeline
187
 
188
 
189
 
190
 
191
 
192
+
193
+
194
  Here we provide the inference code for our EVTAR.
195
 
196
 
197
 
 
 
198
 
199
 
200
+ ```
201
  accelerate launch --num_processes 8 --main_process_port 29500 inference.py \
202
 
203
  --pretrained_model_name_or_path="[path_to_your_Flux_model]" \
 
204
  --instance_data_dir="[your_data_directory]" \
 
205
  --output_dir="[Path_to_LoRA_weights]" \
 
206
  --mixed_precision="bf16" \
 
207
  --split="test" \
 
208
  --height=1024 \
 
209
  --width=768 \
 
210
  --inference_batch_size=1 \
 
211
  --cond_scale=2 \
 
212
  --seed="0" \
 
213
  --use_reference \
 
214
  --use_different \
 
215
  --use_person
 
 
 
216
  ```
217
 
218
 
219
 
 
 
 
 
 
 
220
 
221
 
222
+ - `pretrained_model_name_or_path`: Path to the downloaded Flux-Kontext model weights.
223
 
224
 
225
 
 
 
226
 
227
 
228
+ - `instance_data_dir`: Path to your dataset. For inference on VITON-HD or DressCode, ensure that the words "viton" or "DressCode" appear in the path.
229
 
230
 
231
 
 
 
232
 
233
 
234
+ - `output_dir`: Path to the downloaded or trained LoRA weights.
235
 
236
 
237
 
238
 
239
 
240
+ - `cond_scale`: Resize scale of the reference image during training. Defaults to `1.0` for $512\times384$ and `2.0` for $1024\times768$ resolution.
241
 
242
 
243
 
 
 
244
 
245
 
246
+ - `use_reference`: Whether to use a additonal reference image as input.
247
 
248
 
249
 
 
 
250
 
251
 
252
+ - `use_different`: **Only applicable for VITON/DressCode inference.** Whether to use different cloth-person pairs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
 
254
 
255
 
 
 
256
 
257
 
258
+ - `use_person`: **Only applicable for VITON/DressCode inference.** Whether to use the unmasked person image instead of the agnostic masked image as input for the virtual try-on task.
259
 
260
 
261
 
 
 
262
 
263
 
264
+ ## 📊 Evaluation
265
 
266
 
267
 
268
 
269
 
270
+ We quantitatively evaluate the quality of virtual try-on results using the FID, KID, SSIM, and LPIPS. Here, we provide the evaluation code for the VITON-HD and DressCode datasets.
271
 
272
 
273
 
 
 
274
  ```
275
  # Evaluation on VITON-HD dataset
276
 
277
  CUDA_VISIBLE_DEVICES=0 python eval_dresscode.py \
278
 
279
  --gt_folder_base [path_to_your_ground_truth_image_folder] \
 
280
  --pred_folder_base [[path_to_your_generated_image_folder]]\
 
281
  --paired
282
  ```
283
 
 
285
 
286
 
287
 
288
+
289
+
290
  ```
291
  # Evaluation on DressCode dataset
292
 
293
  CUDA_VISIBLE_DEVICES=0 python eval.py \
294
 
295
  --gt_folder_base [path_to_your_ground_truth_image_folder] \
 
296
  --pred_folder_base [[path_to_your_generated_image_folder]]\
297
  ```
298
 
299
 
300
 
301
+
302
+
303
  - `paired`: If you perform unpaired generation, where different garments are fitted onto the target person, you should enable this flag during evaluation.
304
 
305
+
306
+
307
 
308
  Evaluation result on VITON-HD dataset:
309
+
310
  ![examples](VITON_results.png)
311
 
312
+
313
+
314
 
315
  Evaluation result on DressCode dataset:
316
+
317
  ![examples](DressCode_results.png)
318
 
319
+
320
+
321
  ## 🌸 Acknowledgement
322
 
323
 
324
 
325
+
326
+
327
+ This code is mainly built upon [Diffusers](https://github.com/huggingface/diffusers/tree/main), [Flux](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/flux), and [CatVTON](https://github.com/Zheng-Chong/CatVTON/) repositories. Thanks so much for their solid work!
328
+
329
+
330
 
331
 
332
 
 
334
 
335
  ## 💖 Citation
336
 
 
337
 
338
  If you find this repository useful, please consider citing our paper:
339
  ```