JianhaoZeng commited on
Commit
babf4a5
·
verified ·
1 Parent(s): 7785d0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -6
README.md CHANGED
@@ -29,33 +29,190 @@
29
 
30
  </br>
31
 
32
- [![Paper](https://img.shields.io/badge/arXiv-2507.19946-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2511.18957)
33
  [![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue)](https://huggingface.co/JianhaoZeng/Eevee)
34
 
35
- </br>
36
-
37
-
38
  ## Abstract
39
 
40
  > Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
41
 
42
  ## Dataset Access
43
 
 
44
  ```bash
45
  export HF_ENDPOINT=https://hf-mirror.com
46
  ```
47
 
 
48
  ```python
49
- from modelscope.hub.snapshot_download import snapshot_download
50
- model_dir = snapshot_download('JianhaoZeng/Eevee', cache_dir='./data')
 
 
 
 
51
  ```
52
 
 
53
  ```bash
54
  cd ./data
55
  cat Eevee.zip.part* > Eevee.zip
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Citation
 
 
 
59
  ```
60
  @article{zeng2025eevee,
61
  title={Eevee: Towards Close-up High-resolution Video-based Virtual Try-on},
 
29
 
30
  </br>
31
 
32
+ [![Arxiv](https://img.shields.io/badge/arXiv-2507.19946-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2511.18957)
33
  [![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue)](https://huggingface.co/JianhaoZeng/Eevee)
34
 
 
 
 
35
  ## Abstract
36
 
37
  > Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
38
 
39
  ## Dataset Access
40
 
41
+ 1. Sets the environment variable to point to a mirror site for faster and more stable Hugging Face connections.
42
  ```bash
43
  export HF_ENDPOINT=https://hf-mirror.com
44
  ```
45
 
46
+ 2. Downloads the snapshot from Huggingface and saves it to the local data directory.
47
  ```python
48
+ from huggingface_hub import snapshot_download
49
+ snapshot_download(
50
+ repo_id="JianhaoZeng/Eevee",
51
+ local_dir="./data",
52
+ repo_type="model"
53
+ )
54
  ```
55
 
56
+ 3. Merges the split multi-part files into a single zip archive and extracts the contents.
57
  ```bash
58
  cd ./data
59
  cat Eevee.zip.part* > Eevee.zip
60
+ unzip Eevee.zip -d ./Eevee
61
+ ```
62
+
63
+ ## Data Organization
64
+
65
+ As illustrated in ./Eevee, the following data should be provided.
66
+
67
+ ```
68
+ |-- dresses
69
+ | |-- 00030
70
+ | | |-- garment_caption.txt
71
+ | | |-- garment_detail.png
72
+ | | |-- garment_line.png
73
+ | | |-- garment_mask.png
74
+ | | |-- garment.png
75
+ | | |-- person_agnostic.png
76
+ | | |-- person_mask.png
77
+ | | |-- person.png
78
+ | | |-- video_0_agnostic_sam.mp4
79
+ | | |-- video_0_agnostic.mp4
80
+ | | |-- video_0_densepose.mp4
81
+ | | |-- video_0_mask.mp4
82
+ | | |-- video_0.mp4
83
+ | | |-- video_1_agnostic_sam.mp4
84
+ | | |-- video_1_agnostic.mp4
85
+ | | |-- video_1_densepose.mp4
86
+ | | |-- video_1_mask.mp4
87
+ | | |-- video_1.mp4
88
+ | |-- 00032
89
+ | ...
90
+ |-- lower_body
91
+ | |-- 00003
92
+ | |-- 00006
93
+ | ...
94
+ |-- upper_bdoy
95
+ | |-- 00000
96
+ | |-- 00001
97
+ | ...
98
+ | dresses_test.csv
99
+ | dresses_train.csv
100
+ | lower_test.csv
101
+ | lower_train.csv
102
+ | upper_test.csv
103
+ | upper_train.csv
104
  ```
105
 
106
+
107
+ <table>
108
+ <thead>
109
+ <tr>
110
+ <th>File Name</th>
111
+ <th>Source</th>
112
+ <th>Description</th>
113
+ </tr>
114
+ </thead>
115
+ <tbody>
116
+ <tr>
117
+ <td colspan="3"><strong>--- Garment Data ---</strong></td>
118
+ </tr>
119
+ <tr>
120
+ <td>garment.png</td><td>Raw data</td>
121
+ <td>In-shop garment image</td>
122
+ </tr>
123
+ <tr>
124
+ <td>garment_detail.png</td><td>Raw data</td>
125
+ <td>Dataied garment image</td>
126
+ </tr>
127
+ <tr>
128
+ <td>garment_caption.txt</td><td>Qwen-VL-MAX</td>
129
+ <td>Detailed text description of garment image generated by Qwen-vl-max</td>
130
+ </tr>
131
+ <tr>
132
+ <td>garment_line.png </td><td>AniLines</td>
133
+ <td>Lineart of garment image generated by AniLines</td>
134
+ </tr>
135
+ <tr>
136
+ <td>garment_mask.png</td><td>Grounded SAM-2</td>
137
+ <td>Binary mask of garment image generated by Grounded SAM-2</td>
138
+ <tr>
139
+ <td colspan="3"><strong>--- Person Data ---</strong></td>
140
+ </tr>
141
+ </tr>
142
+ <td>person.png</td><td>Raw data</td>
143
+ <td>Image of a person wearing the corresponding garment</td>
144
+ </tr>
145
+ </tr>
146
+ <td>person_mask.png</td><td>Grounded SAM-2</td>
147
+ <td>Binary mask of the garment area on the person image generated by Grounded SAM-2</td>
148
+ </tr>
149
+ </tr>
150
+ <td>person_agnostic.png</td><td>Multiplication</td>
151
+ <td>Person image with garment area masked out generated by pixel-wise multiplication</td>
152
+ </tr>
153
+ <tr>
154
+ <td colspan="3"><strong>--- Full-shot person video Data ---</strong></td>
155
+ </tr>
156
+ </tr>
157
+ <td>video_0.mp4</td><td>Raw data</td>
158
+ <td>Full-shot person video</td>
159
+ </tr>
160
+ </tr>
161
+ <td>video_0_mask.mp4</td><td>OpenPose</td>
162
+ <td>Binary mask of the garment area on the full-shot person video generated by OpenPose</td>
163
+ </tr>
164
+ </tr>
165
+ <td>video_0_agnostic.mp4</td><td>Multiplication</td>
166
+ <td>Full-shot person video with garment area masked out generated by pixel-wise multiplication</td>
167
+ </tr>
168
+ </tr>
169
+ <td>video_0_agnostic_sam.mp4</td><td>Grounded SAM-2</td>
170
+ <td>Full-shot person video with garment area masked out generated by Grounded SAM-2</td>
171
+ </tr>
172
+ </tr>
173
+ <td>video_0_densepose.mp4</td><td>Detectron2</td>
174
+ <td>DensePose UV coordinates for the human body of full-shot person video generated by Detectron2</td>
175
+ </tr>
176
+ <tr>
177
+ <td colspan="3"><strong>--- Close-up person video Data ---</strong></td>
178
+ </tr>
179
+ </tr>
180
+ <td>video_1.mp4 </td><td>Raw data</td>
181
+ <td>Close-up person video</td>
182
+ </tr>
183
+ </tr>
184
+ <td>video_1_mask.mp4</td><td> Grounded SAM-2</td>
185
+ <td>Binary mask of the garment area on the Close-up person video generated by Grounded SAM-2</td>
186
+ </tr>
187
+ </tr>
188
+ <td>video_1_agnostic.mp4</td><td>Multiplication</td>
189
+ <td>Close-up person video with garment area masked out generated by pixel-wise multiplication</td>
190
+ </tr>
191
+ </tr>
192
+ <td>video_1_agnostic_sam.mp4</td><td>Grounded SAM-2</td>
193
+ <td>Close-up person video with garment area masked out generated by Grounded SAM-2</td>
194
+ </tr>
195
+ </tr>
196
+ <td>video_1_densepose.mp4</td><td>Detectron2</td>
197
+ <td>DensePose UV coordinates for the human body of close-up person video generated by Detectron2</td>
198
+ </tr>
199
+ </tbody>
200
+ </table>
201
+
202
+
203
+
204
+
205
+
206
+
207
+
208
+ ## Contact
209
+
210
+ If you have any questions, please reach out via email at jh_zeng@tju.edu.cn
211
+
212
  ## Citation
213
+
214
+ If you find this work useful for your research, please cite our paper:
215
+
216
  ```
217
  @article{zeng2025eevee,
218
  title={Eevee: Towards Close-up High-resolution Video-based Virtual Try-on},