Video-to-Video
English
CaiYuanhao commited on
Commit
a0fe5ab
verified
1 Parent(s): 4c61004

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +288 -3
README.md CHANGED
@@ -1,3 +1,288 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - CaiYuanhao/OmniVCus-Train
5
+ - CaiYuanhao/OmniVCus-Test
6
+ language:
7
+ - en
8
+ base_model:
9
+ - Wan-AI/Wan2.1-T2V-1.3B
10
+ - Wan-AI/Wan2.2-T2V-A14B
11
+ - Wan-AI/Wan2.1-T2V-14B
12
+ pipeline_tag: video-to-video
13
+ modalities:
14
+ - video
15
+ - image
16
+ - text
17
+ arxiv: 2506.23361
18
+ ---
19
+
20
+ # [NeurIPS 2025] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
21
+
22
+ ## Model Description
23
+
24
+ These three models support multi-modal control video customization tasks, including reference-to-video, reference-mask-to-video,
25
+ reference-depth-to-video, and reference-instruction-to-video generation. Our models are based on Wan2.1-1.3B, Wan2.1-14B, Wan2.2-14B,
26
+ and VACE. Here are some comparisons with the state-of-the-art method VACE on video customization:
27
+
28
+ 路 (a) 2.1-1.3B model
29
+
30
+ <p align="center">
31
+ <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
32
+
33
+ <!-- ===== Row 1 Prompt ===== -->
34
+ <tr>
35
+ <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
36
+ (a1) a woman rolling up a fitted sheet
37
+ </td>
38
+ </tr>
39
+ <tr>
40
+ <td style="border:0;padding:10px;">
41
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/26.png" width="400">
42
+ </td>
43
+ <td style="border:0;padding:10px;">
44
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/26_depth.gif" width="400">
45
+ </td>
46
+ </tr>
47
+ <tr>
48
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
49
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Depth Video</td>
50
+ </tr>
51
+ <tr>
52
+ <td style="border:0;padding:10px;">
53
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/26.gif" width="400">
54
+ </td>
55
+ <td style="border:0;padding:10px;">
56
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/26_our.gif" width="400">
57
+ </td>
58
+ </tr>
59
+ <tr>
60
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-1.3B</td>
61
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-1.3B (Ours)</td>
62
+ </tr>
63
+
64
+
65
+ <!-- ===== Row 2 Prompt ===== -->
66
+ <tr>
67
+ <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
68
+ (a2) a church in the winter
69
+ </td>
70
+ </tr>
71
+ <tr>
72
+ <td style="border:0;padding:10px;">
73
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/32.png" width="400">
74
+ </td>
75
+ <td style="border:0;padding:10px;">
76
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/32_mask.gif" width="400">
77
+ </td>
78
+ </tr>
79
+ <tr>
80
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
81
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
82
+ </tr>
83
+ <tr>
84
+ <td style="border:0;padding:10px;">
85
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/32.gif" width="400">
86
+ </td>
87
+ <td style="border:0;padding:10px;">
88
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/1.3B/32_our.gif" width="400">
89
+ </td>
90
+ </tr>
91
+ <tr>
92
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-1.3B</td>
93
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-1.3B</td>
94
+ </tr>
95
+ </table>
96
+ </p>
97
+
98
+
99
+
100
+ 路 (b) 2.1-14B model
101
+
102
+ <p align="center">
103
+ <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
104
+
105
+ <!-- ===== Row 1 Prompt ===== -->
106
+ <tr>
107
+ <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
108
+ (b1) a man holding a piece of paper in his hands
109
+ </td>
110
+ </tr>
111
+ <tr>
112
+ <td style="border:0;padding:10px;">
113
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/33.png" width="400">
114
+ </td>
115
+ <td style="border:0;padding:10px;">
116
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/33_depth.gif" width="400">
117
+ </td>
118
+ </tr>
119
+ <tr>
120
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
121
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Depth Video</td>
122
+ </tr>
123
+ <tr>
124
+ <td style="border:0;padding:10px;">
125
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/33.gif" width="400">
126
+ </td>
127
+ <td style="border:0;padding:10px;">
128
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/33_our.gif" width="400">
129
+ </td>
130
+ </tr>
131
+ <tr>
132
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
133
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
134
+ </tr>
135
+
136
+
137
+ <!-- ===== Row 2 Prompt ===== -->
138
+ <tr>
139
+ <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
140
+ (b2) a boy in a medical gown and hairnet in a hospital room
141
+ </td>
142
+ </tr>
143
+ <tr>
144
+ <td style="border:0;padding:10px;">
145
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65.png" width="400">
146
+ </td>
147
+ <td style="border:0;padding:10px;">
148
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65_mask.gif" width="400">
149
+ </td>
150
+ </tr>
151
+ <tr>
152
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
153
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
154
+ </tr>
155
+ <tr>
156
+ <td style="border:0;padding:10px;">
157
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65.gif" width="400">
158
+ </td>
159
+ <td style="border:0;padding:10px;">
160
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65_our.gif" width="400">
161
+ </td>
162
+ </tr>
163
+ <tr>
164
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
165
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
166
+ </tr>
167
+ </table>
168
+ </p>
169
+
170
+
171
+
172
+ 路 (c) 2.2-14B model
173
+
174
+ <p align="center">
175
+ <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
176
+
177
+ <!-- ===== Row 1 Prompt ===== -->
178
+ <tr>
179
+ <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
180
+ (c1) a boy looking into an open refrigerator, with tomatoes and a bottle of water on the floor
181
+ </td>
182
+ </tr>
183
+ <tr>
184
+ <td style="border:0;padding:10px;">
185
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27.png" width="400">
186
+ </td>
187
+ <td style="border:0;padding:10px;">
188
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27_depth.gif" width="400">
189
+ </td>
190
+ </tr>
191
+ <tr>
192
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
193
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Depth Video</td>
194
+ </tr>
195
+ <tr>
196
+ <td style="border:0;padding:10px;">
197
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27.gif" width="400">
198
+ </td>
199
+ <td style="border:0;padding:10px;">
200
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27_our.gif" width="400">
201
+ </td>
202
+ </tr>
203
+ <tr>
204
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.2-14B</td>
205
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.2-14B (Ours)</td>
206
+ </tr>
207
+
208
+
209
+ <!-- ===== Row 2 Prompt ===== -->
210
+ <tr>
211
+ <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
212
+ (c2) a woman standing in a room
213
+ </td>
214
+ </tr>
215
+ <tr>
216
+ <td style="border:0;padding:10px;">
217
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54.png" width="400">
218
+ </td>
219
+ <td style="border:0;padding:10px;">
220
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54_mask.gif" width="400">
221
+ </td>
222
+ </tr>
223
+ <tr>
224
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
225
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
226
+ </tr>
227
+ <tr>
228
+ <td style="border:0;padding:10px;">
229
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54.gif" width="400">
230
+ </td>
231
+ <td style="border:0;padding:10px;">
232
+ <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54_our.gif" width="400">
233
+ </td>
234
+ </tr>
235
+ <tr>
236
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.2-14B</td>
237
+ <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.2-14B (Ours)</td>
238
+ </tr>
239
+ </table>
240
+ </p>
241
+
242
+ ## Github Code Link
243
+
244
+ Please refer to our GitHub repo for more detailed instructions on using our code and models.
245
+
246
+ https://github.com/caiyuanhao1998/Open-OmniVCus
247
+
248
+
249
+ ## Training Data Link
250
+
251
+ Our models are trained on our curated dataset:
252
+
253
+ https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
254
+
255
+
256
+ ## Testing Data Link
257
+
258
+ We provide 648 data samples to test our models
259
+
260
+ https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
261
+
262
+
263
+ ## Project Page Link
264
+
265
+ For more video customization results, please refer to our project page:
266
+
267
+ https://caiyuanhao1998.github.io/project/OmniVCus/
268
+
269
+
270
+ ## Arxiv Paper Link
271
+
272
+ For more technical details, please refer to our NeurIPS 2025 paper:
273
+
274
+ https://arxiv.org/abs/2506.23361
275
+
276
+
277
+ ## Citation
278
+
279
+ If you find our code, data, and models useful, please consider citing our paper:
280
+
281
+ ```sh
282
+ @inproceedings{omnivcus,
283
+ title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
284
+ author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
285
+ booktitle={NeurIPS},
286
+ year={2025}
287
+ }
288
+ ```