Video-to-Video
English
nielsr HF Staff commited on
Commit
fb9f289
·
verified ·
1 Parent(s): a0fe5ab

Update pipeline tag and improve model card layout

Browse files

This PR improves the model card by:
- Updating the `pipeline_tag` to `image-to-video` to better reflect the model's primary task of subject-driven video customization.
- Adding direct links to the paper, project page, and GitHub repository for easier access.
- Refining the layout to clearly present the model description, data links, and visual comparisons.
- Maintaining the qualitative comparison tables provided in the original README.

Files changed (1) hide show
  1. README.md +35 -152
README.md CHANGED
@@ -1,15 +1,15 @@
1
  ---
2
- license: apache-2.0
 
 
 
3
  datasets:
4
  - CaiYuanhao/OmniVCus-Train
5
  - CaiYuanhao/OmniVCus-Test
6
  language:
7
  - en
8
- base_model:
9
- - Wan-AI/Wan2.1-T2V-1.3B
10
- - Wan-AI/Wan2.2-T2V-A14B
11
- - Wan-AI/Wan2.1-T2V-14B
12
- pipeline_tag: video-to-video
13
  modalities:
14
  - video
15
  - image
@@ -19,13 +19,27 @@ arxiv: 2506.23361
19
 
20
  # [NeurIPS 2025] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
21
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## Model Description
23
 
24
- These three models support multi-modal control video customization tasks, including reference-to-video, reference-mask-to-video,
25
- reference-depth-to-video, and reference-instruction-to-video generation. Our models are based on Wan2.1-1.3B, Wan2.1-14B, Wan2.2-14B,
26
- and VACE. Here are some comparisons with the state-of-the-art method VACE on video customization:
27
 
28
- · (a) 2.1-1.3B model
 
 
29
 
30
  <p align="center">
31
  <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
@@ -95,9 +109,7 @@ and VACE. Here are some comparisons with the state-of-the-art method VACE on vid
95
  </table>
96
  </p>
97
 
98
-
99
-
100
- · (b) 2.1-14B model
101
 
102
  <p align="center">
103
  <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
@@ -132,157 +144,28 @@ and VACE. Here are some comparisons with the state-of-the-art method VACE on vid
132
  <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
133
  <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
134
  </tr>
135
-
136
-
137
- <!-- ===== Row 2 Prompt ===== -->
138
- <tr>
139
- <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
140
- (b2) a boy in a medical gown and hairnet in a hospital room
141
- </td>
142
- </tr>
143
- <tr>
144
- <td style="border:0;padding:10px;">
145
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65.png" width="400">
146
- </td>
147
- <td style="border:0;padding:10px;">
148
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65_mask.gif" width="400">
149
- </td>
150
- </tr>
151
- <tr>
152
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
153
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
154
- </tr>
155
- <tr>
156
- <td style="border:0;padding:10px;">
157
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65.gif" width="400">
158
- </td>
159
- <td style="border:0;padding:10px;">
160
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65_our.gif" width="400">
161
- </td>
162
- </tr>
163
- <tr>
164
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
165
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
166
- </tr>
167
- </table>
168
- </p>
169
-
170
-
171
-
172
- · (c) 2.2-14B model
173
-
174
- <p align="center">
175
- <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
176
-
177
- <!-- ===== Row 1 Prompt ===== -->
178
- <tr>
179
- <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
180
- (c1) a boy looking into an open refrigerator, with tomatoes and a bottle of water on the floor
181
- </td>
182
- </tr>
183
- <tr>
184
- <td style="border:0;padding:10px;">
185
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27.png" width="400">
186
- </td>
187
- <td style="border:0;padding:10px;">
188
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27_depth.gif" width="400">
189
- </td>
190
- </tr>
191
- <tr>
192
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
193
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">Depth Video</td>
194
- </tr>
195
- <tr>
196
- <td style="border:0;padding:10px;">
197
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27.gif" width="400">
198
- </td>
199
- <td style="border:0;padding:10px;">
200
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27_our.gif" width="400">
201
- </td>
202
- </tr>
203
- <tr>
204
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.2-14B</td>
205
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.2-14B (Ours)</td>
206
- </tr>
207
-
208
-
209
- <!-- ===== Row 2 Prompt ===== -->
210
- <tr>
211
- <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
212
- (c2) a woman standing in a room
213
- </td>
214
- </tr>
215
- <tr>
216
- <td style="border:0;padding:10px;">
217
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54.png" width="400">
218
- </td>
219
- <td style="border:0;padding:10px;">
220
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54_mask.gif" width="400">
221
- </td>
222
- </tr>
223
- <tr>
224
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
225
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
226
- </tr>
227
- <tr>
228
- <td style="border:0;padding:10px;">
229
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54.gif" width="400">
230
- </td>
231
- <td style="border:0;padding:10px;">
232
- <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54_our.gif" width="400">
233
- </td>
234
- </tr>
235
- <tr>
236
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.2-14B</td>
237
- <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.2-14B (Ours)</td>
238
- </tr>
239
  </table>
240
  </p>
241
 
242
- ## Github Code Link
243
-
244
- Please refer to our GitHub repo for more detailed instructions on using our code and models.
245
-
246
- https://github.com/caiyuanhao1998/Open-OmniVCus
247
-
248
-
249
- ## Training Data Link
250
-
251
- Our models are trained on our curated dataset:
252
-
253
- https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
254
-
255
-
256
- ## Testing Data Link
257
-
258
- We provide 648 data samples to test our models
259
-
260
- https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
261
-
262
-
263
- ## Project Page Link
264
-
265
- For more video customization results, please refer to our project page:
266
-
267
- https://caiyuanhao1998.github.io/project/OmniVCus/
268
-
269
-
270
- ## Arxiv Paper Link
271
-
272
- For more technical details, please refer to our NeurIPS 2025 paper:
273
-
274
- https://arxiv.org/abs/2506.23361
275
 
 
 
 
 
 
276
 
277
  ## Citation
278
 
279
  If you find our code, data, and models useful, please consider citing our paper:
280
 
281
- ```sh
282
  @inproceedings{omnivcus,
283
  title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
284
  author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
285
  booktitle={NeurIPS},
286
  year={2025}
287
  }
288
- ```
 
 
 
1
  ---
2
+ base_model:
3
+ - Wan-AI/Wan2.1-T2V-1.3B
4
+ - Wan-AI/Wan2.2-T2V-A14B
5
+ - Wan-AI/Wan2.1-T2V-14B
6
  datasets:
7
  - CaiYuanhao/OmniVCus-Train
8
  - CaiYuanhao/OmniVCus-Test
9
  language:
10
  - en
11
+ license: apache-2.0
12
+ pipeline_tag: image-to-video
 
 
 
13
  modalities:
14
  - video
15
  - image
 
19
 
20
  # [NeurIPS 2025] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
21
 
22
+ This repository contains the weights for **OmniVCus**, a diffusion Transformer framework presented at NeurIPS 2025 for subject-driven video customization with multimodal control conditions.
23
+
24
+ [[Paper](https://arxiv.org/abs/2506.23361)] [[Project Page](https://caiyuanhao1998.github.io/project/OmniVCus/)] [[GitHub Code](https://github.com/caiyuanhao1998/Open-OmniVCus)]
25
+
26
+ ## Introduction
27
+
28
+ OmniVCus is a feedforward subject-driven video customization framework that supports multimodal control conditions. It allows users to control and edit subjects in customized videos using signals such as depth maps, masks, camera motion, and text prompts.
29
+
30
+ Key innovations include:
31
+ - **VideoCus-Factory**: A data construction pipeline to produce training pairs for multi-subject customization from raw videos.
32
+ - **Diffusion Transformer Framework**: Incorporates *Lottery Embedding (LE)* for better subject generalization and *Temporally Aligned Embedding (TAE)* for guidance from control signals like depth and masks.
33
+
34
  ## Model Description
35
 
36
+ The released models support multi-modal control video customization tasks, including reference-to-video, reference-mask-to-video, reference-depth-to-video, and reference-instruction-to-video generation. These models are built upon **Wan2.1** (1.3B and 14B) and **Wan2.2** (14B) architectures.
37
+
38
+ ### Qualitative Comparisons
39
 
40
+ Below are some comparisons with the state-of-the-art method VACE on video customization:
41
+
42
+ #### (a) 2.1-1.3B model
43
 
44
  <p align="center">
45
  <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
 
109
  </table>
110
  </p>
111
 
112
+ #### (b) 2.1-14B model
 
 
113
 
114
  <p align="center">
115
  <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
 
144
  <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
145
  <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
146
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  </table>
148
  </p>
149
 
150
+ ## Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
+ - **GitHub Repository**: [Open-OmniVCus](https://github.com/caiyuanhao1998/Open-OmniVCus)
153
+ - **Training Data**: [OmniVCus-Train](https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train)
154
+ - **Testing Data**: [OmniVCus-Test](https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test)
155
+ - **Project Page**: [OmniVCus Project](https://caiyuanhao1998.github.io/project/OmniVCus/)
156
+ - **Arxiv Paper**: [2506.23361](https://arxiv.org/abs/2506.23361)
157
 
158
  ## Citation
159
 
160
  If you find our code, data, and models useful, please consider citing our paper:
161
 
162
+ ```bibtex
163
  @inproceedings{omnivcus,
164
  title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
165
  author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
166
  booktitle={NeurIPS},
167
  year={2025}
168
  }
169
+ ```
170
+
171
+ `Acknowledgments:` Our code is built upon and inspired by [Wan2.1](https://github.com/Wan-Video/Wan2.1), [Wan2.2](https://github.com/Wan-Video/Wan2.2), [VACE](https://github.com/ali-vilab/VACE), [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio), [SAM2](https://github.com/facebookresearch/sam2), [Depth-Anything-V2](https://github.com/DepthAnything/Depth-Anything-V2), [Video-Depth-Anything](https://github.com/DepthAnything/Video-Depth-Anything), and [CoTracker3](https://github.com/facebookresearch/co-tracker).