Image-to-Image
English
File size: 7,318 Bytes
ee60a3c
 
 
 
 
 
0c873fc
5402766
 
caf73f1
0ab275d
5402766
 
71e66aa
5402766
0ab275d
0c873fc
0ab275d
 
0c873fc
 
5402766
 
7e29c80
0ab275d
 
5402766
 
0c873fc
7e29c80
0c873fc
 
5402766
 
0c873fc
 
 
 
 
 
 
 
3ad8d51
5402766
0c873fc
0ab275d
 
 
 
0c873fc
 
5402766
 
0ab275d
 
0c873fc
 
5402766
 
0ab275d
 
 
0c873fc
0ab275d
 
5402766
 
0ab275d
 
 
 
0c873fc
 
5402766
 
0ab275d
 
0c873fc
 
5402766
 
0ab275d
 
0c873fc
 
5402766
 
0ab275d
 
 
2fe123e
0c873fc
0ab275d
5402766
 
0ab275d
 
 
 
0c873fc
 
 
 
5402766
7e29c80
 
0ab275d
5402766
0ab275d
0c873fc
 
0ab275d
 
0c873fc
 
0ab275d
 
 
 
0c873fc
 
5402766
 
0ab275d
 
 
 
0c873fc
 
5402766
 
 
 
 
0ab275d
0c873fc
 
0ab275d
 
0c873fc
 
5402766
 
0ab275d
 
0c873fc
 
0ab275d
 
5402766
 
0c873fc
 
5402766
0ab275d
0c873fc
 
5402766
0ab275d
0c873fc
 
5402766
0c873fc
5402766
 
0ab275d
 
 
 
5402766
0ab275d
0c873fc
 
0ab275d
 
0c873fc
 
0ab275d
 
0c873fc
 
5402766
 
0ab275d
 
0c873fc
d0ca0c9
0ab275d
 
5402766
 
0ab275d
 
 
 
0c873fc
 
5402766
 
7e29c80
0ab275d
0c873fc
 
 
 
5402766
0ab275d
0c873fc
0ab275d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c873fc
 
 
 
5402766
0ab275d
0c873fc
 
 
 
5402766
0ab275d
0c873fc
 
 
 
5402766
0ab275d
 
 
0c873fc
 
5402766
0ab275d
0c873fc
 
 
 
5402766
0ab275d
0c873fc
 
 
 
5402766
0ab275d
0c873fc
 
 
 
5402766
0ab275d
0c873fc
 
 
 
5402766
0ab275d
 
0c873fc
0ab275d
 
5402766
0ab275d
 
 
0c873fc
0ab275d
0c873fc
0ab275d
0c873fc
0ab275d
 
0c873fc
0ab275d
 
 
 
0c873fc
0ab275d
5402766
 
0c873fc
0ab275d
0c873fc
0ab275d
0c873fc
0ab275d
0c873fc
0ab275d
 
 
 
5402766
 
0c873fc
 
5402766
 
0c873fc
 
5402766
0c873fc
 
5402766
 
0c873fc
 
5402766
0c873fc
 
5402766
 
0ab275d
 
0c873fc
 
5402766
 
 
 
 
0ab275d
 
 
0c873fc
 
0ab275d
 
0c873fc
f6c80ee
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
---
license: apache-2.0
language:
- en
pipeline_tag: image-to-image
---

  

# person-to-person Try on with Additional Unpaired Visual Reference

  

[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-RefVTON-ff9900?style=flat)](https://huggingface.co/qihoo360/RefVTON) [![arXiv](https://img.shields.io/badge/arXiv-2101.00001-B31B1B?style=flat)](https://arxiv.org/abs/2511.00956)

  

![examples](examples.png)

  

  

We propose **REFVTON**, an End-to-End Virtual Try-on model with Additional Visual Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance the model's ability to preserve and accurately depict clothing details.

  


## 💡 Github
[REFVTON](https://github.com/360CVGroup/REFVTON)




## 💡 Pretrained Models

We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.

512_384_pytorch_lora_weights.safetensors:512x384 resolution high-quality virtual fitting model
✅ **Available**

1024_768_pytorch_lora_weights.safetensors:1024x768 resolution high-quality virtual fitting model
✅ **Available**
  

  

## 💡 Update

  

  

- [x] [2025.10.11] Release the virtual try-on inference code and LoRA weights.

  

  

- [x] [2025.10.13] Release the technical report on Arxiv.

  

  

  

## 💪 Highlight Feature

  

  

  

-  **And End-To-End virtual try-on model:** Can function either as an inpainting model for placing the target clothing into masked areas, or as a direct garment transfer onto the human body.

  

  

-  **Using Reference Image To Enhance the Try-on Performance:** To emulate human attention on the overall wearing effect rather than the garment itself when shopping online, our model allows using images of a model wearing the target clothing as input, thereby better preserving its material texture and design details.

  

  

-  **Improved Performance** Our model achieves state-of-the-art performance on public benchmarks and demonstrates strong generalization ability to in-the-wild inputs.

  

  

  

## 🧩 Environment Setup

  

  

  

```
conda create -n REFVTON python=3.12 -y
conda activate REFVTON
pip install -r requirements.txt
```

  

  

  

## 📂 Preparation of Dataset and Pretrained Models

  

  

  

### Dataset

  

  

  

Currently, we provide a small test set with additional reference images "difference person wearing the target cloth" for trying our model. We plan to release the reference data generation code, along with our proposed full dataset containing model reference images, in the future.

  

  

Nevertheless, inference can still be performed in a reference-free setting on public benchmarks, including [VITON-HD](https://github.com/shadow2496/VITON-HD) and [DressCode](https://github.com/aimagelab/dress-code).

  

  

### Reference Data Preparation

  

  

One key feature of our method is the use of _reference data_, where an image of a different person wearing the target garment is provided to help the model imagine how the target person would look in that garment. In most online shopping applications, such additonal reference images are commonly used by customers to better visualize the clothing. However, publicly available datasets such as VITON-HD and DressCode do not include such reference data, so we generate them ourselves.

  

  

  

Please prepare the pretrained weights of the Flux-Kontext model and the Qwen2.5-VL-32B model. And you can generate the additonal reference image using the following commands:

  

  

```
accelerate launch --num_processes 8 --main_process_port 29500 generate_reference.py \
--instance_data_dir "path_to_your_datasets" \
--inference_batch_size 1 \
--split "train" \
--desc_path "desc.json"
```

  

  

  

### Pretrained Models

  

  

We provide pretrained backbone networks and LoRA weights for testing and deployment. Please download the `.safetensors` files from [here] and place them in the `checkpoints` directory.

  

  

  

## ⏳ Inference Pipeline

  

  

  

Here we provide the inference code for our REFVTON.

  

  

```
accelerate launch --num_processes 8 --main_process_port 29500 inference.py \

--pretrained_model_name_or_path="[path_to_your_Flux_model]" \
--instance_data_dir="[your_data_directory]" \
--output_dir="[Path_to_LoRA_weights]" \
--mixed_precision="bf16" \
--split="test" \
--height=1024 \
--width=768 \
--inference_batch_size=1 \
--cond_scale=2 \
--seed="0" \
--use_reference \
--use_different \
--use_person
```

  

  

-  `pretrained_model_name_or_path`: Path to the downloaded Flux-Kontext model weights.

  

  

-  `instance_data_dir`: Path to your dataset. For inference on VITON-HD or DressCode, ensure that the words "viton" or "DressCode" appear in the path.

  

  

-  `output_dir`: Path to the downloaded or trained LoRA weights.

  

  

-  `cond_scale`: Resize scale of the reference image during training. Defaults to `1.0` for $512\times384$ and `2.0` for $1024\times768$ resolution.

  

  

-  `use_reference`: Whether to use a additonal reference image as input.

  

  

-  `use_different`: **Only applicable for VITON/DressCode inference.** Whether to use different cloth-person pairs.

  

  

-  `use_person`: **Only applicable for VITON/DressCode inference.** Whether to use the unmasked person image instead of the agnostic masked image as input for the virtual try-on task.

  

  

## 📊 Evaluation

  

  

We quantitatively evaluate the quality of virtual try-on results using the FID, KID, SSIM, and LPIPS. Here, we provide the evaluation code for the VITON-HD and DressCode datasets.

  

```
# Evaluation on VITON-HD dataset

CUDA_VISIBLE_DEVICES=0 python eval_dresscode.py \

--gt_folder_base [path_to_your_ground_truth_image_folder] \
--pred_folder_base [[path_to_your_generated_image_folder]]\
--paired
```

  

  

  

```
# Evaluation on DressCode dataset

CUDA_VISIBLE_DEVICES=0 python eval.py \

--gt_folder_base [path_to_your_ground_truth_image_folder] \
--pred_folder_base [[path_to_your_generated_image_folder]]\
```

  

  

-  `paired`: If you perform unpaired generation, where different garments are fitted onto the target person, you should enable this flag during evaluation.

  
  

Evaluation result on VITON-HD dataset:

![examples](VITON_results.png)

  
  

Evaluation result on DressCode dataset:

![examples](DressCode_results.png)

  

## 🌸 Acknowledgement

  

  

This code is mainly built upon [Diffusers](https://github.com/huggingface/diffusers/tree/main), [Flux](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/flux), and [CatVTON](https://github.com/Zheng-Chong/CatVTON/) repositories. Thanks so much for their solid work!

  

  

  

## 💖 Citation


If you find this repository useful, please consider citing our paper:
```
@misc{li2025evtarendtoendtryadditional,
title={EVTAR: End-to-End Try on with Additional Unpaired Visual Reference},
author={Liuzhuozheng Li and Yue Gong and Shanyuan Liu and Bo Cheng and Yuhang Ma and Liebucha Wu and Dengyang Jiang and Zanyi Wang and Dawei Leng and Yuhui Yin},
year={2025},
eprint={2511.00956},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.00956},
}
```