Update pipeline tag and improve model card layout

This PR improves the model card by:
- Updating the `pipeline_tag` to `image-to-video` to better reflect the model's primary task of subject-driven video customization.
- Adding direct links to the paper, project page, and GitHub repository for easier access.
- Refining the layout to clearly present the model description, data links, and visual comparisons.
- Maintaining the qualitative comparison tables provided in the original README.

Files changed (1) hide show

README.md +35 -152

README.md CHANGED Viewed

@@ -1,15 +1,15 @@
 ---
-license: apache-2.0
 datasets:
 - CaiYuanhao/OmniVCus-Train
 - CaiYuanhao/OmniVCus-Test
 language:
 - en
-base_model:
-- Wan-AI/Wan2.1-T2V-1.3B
-- Wan-AI/Wan2.2-T2V-A14B
-- Wan-AI/Wan2.1-T2V-14B
-pipeline_tag: video-to-video
 modalities:
 - video
 - image
@@ -19,13 +19,27 @@ arxiv: 2506.23361
 # [NeurIPS 2025] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
 ## Model Description
-These three models support multi-modal control video customization tasks, including reference-to-video, reference-mask-to-video,
-reference-depth-to-video, and reference-instruction-to-video generation. Our models are based on Wan2.1-1.3B, Wan2.1-14B, Wan2.2-14B,
-and VACE. Here are some comparisons with the state-of-the-art method VACE on video customization:
-· (a) 2.1-1.3B model
 <p align="center">
 <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
@@ -95,9 +109,7 @@ and VACE. Here are some comparisons with the state-of-the-art method VACE on vid
 </table>
 </p>
-· (b) 2.1-14B model
 <p align="center">
 <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
@@ -132,157 +144,28 @@ and VACE. Here are some comparisons with the state-of-the-art method VACE on vid
     <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
     <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
   </tr>
-<!-- ===== Row 2 Prompt ===== -->
-  <tr>
-    <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
-      (b2) a boy in a medical gown and hairnet in a hospital room
-    </td>
-  </tr>
-  <tr>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65.png" width="400">
-    </td>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65_mask.gif" width="400">
-    </td>
-  </tr>
-  <tr>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
-  </tr>
-  <tr>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65.gif" width="400">
-    </td>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.1/65_our.gif" width="400">
-    </td>
-  </tr>
-  <tr>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
-  </tr>
-</table>
-</p>
-· (c) 2.2-14B model
-<p align="center">
-<table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
-  <!-- ===== Row 1 Prompt ===== -->
-  <tr>
-    <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
-      (c1) a boy looking into an open refrigerator, with tomatoes and a bottle of water on the floor
-    </td>
-  </tr>
-  <tr>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27.png" width="400">
-    </td>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27_depth.gif" width="400">
-    </td>
-  </tr>
-  <tr>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">Depth Video</td>
-  </tr>
-  <tr>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27.gif" width="400">
-    </td>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/27_our.gif" width="400">
-    </td>
-  </tr>
-  <tr>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.2-14B</td>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.2-14B (Ours)</td>
-  </tr>
-<!-- ===== Row 2 Prompt ===== -->
-  <tr>
-    <td colspan="2" align="center" style="border:0;padding:6px 10px;font-style:italic;">
-      (c2) a woman standing in a room
-    </td>
-  </tr>
-  <tr>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54.png" width="400">
-    </td>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54_mask.gif" width="400">
-    </td>
-  </tr>
-  <tr>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">Reference Image</td>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">Mask Video</td>
-  </tr>
-  <tr>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54.gif" width="400">
-    </td>
-    <td style="border:0;padding:10px;">
-      <img src="https://raw.githubusercontent.com/caiyuanhao1998/Open-OmniVCus/master/DiffSynth-Studio/gif_demo/14B_2.2/54_our.gif" width="400">
-    </td>
-  </tr>
-  <tr>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.2-14B</td>
-    <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.2-14B (Ours)</td>
-  </tr>
 </table>
 </p>
-## Github Code Link
-Please refer to our GitHub repo for more detailed instructions on using our code and models.
-https://github.com/caiyuanhao1998/Open-OmniVCus
-## Training Data Link
-Our models are trained on our curated dataset:
-https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train
-## Testing Data Link
-We provide 648 data samples to test our models
-https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test
-## Project Page Link
-For more video customization results, please refer to our project page:
-https://caiyuanhao1998.github.io/project/OmniVCus/
-## Arxiv Paper Link
-For more technical details, please refer to our NeurIPS 2025 paper:
-https://arxiv.org/abs/2506.23361
 ## Citation
 If you find our code, data, and models useful, please consider citing our paper:
-```sh
 @inproceedings{omnivcus,
   title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
   author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
   booktitle={NeurIPS},
   year={2025}
 }
-```

 ---
+base_model:
+- Wan-AI/Wan2.1-T2V-1.3B
+- Wan-AI/Wan2.2-T2V-A14B
+- Wan-AI/Wan2.1-T2V-14B
 datasets:
 - CaiYuanhao/OmniVCus-Train
 - CaiYuanhao/OmniVCus-Test
 language:
 - en
+license: apache-2.0
+pipeline_tag: image-to-video
 modalities:
 - video
 - image
 # [NeurIPS 2025] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
+This repository contains the weights for **OmniVCus**, a diffusion Transformer framework presented at NeurIPS 2025 for subject-driven video customization with multimodal control conditions.
+[[Paper](https://arxiv.org/abs/2506.23361)] [[Project Page](https://caiyuanhao1998.github.io/project/OmniVCus/)] [[GitHub Code](https://github.com/caiyuanhao1998/Open-OmniVCus)]
+## Introduction
+OmniVCus is a feedforward subject-driven video customization framework that supports multimodal control conditions. It allows users to control and edit subjects in customized videos using signals such as depth maps, masks, camera motion, and text prompts.
+Key innovations include:
+- **VideoCus-Factory**: A data construction pipeline to produce training pairs for multi-subject customization from raw videos.
+- **Diffusion Transformer Framework**: Incorporates *Lottery Embedding (LE)* for better subject generalization and *Temporally Aligned Embedding (TAE)* for guidance from control signals like depth and masks.
 ## Model Description
+The released models support multi-modal control video customization tasks, including reference-to-video, reference-mask-to-video, reference-depth-to-video, and reference-instruction-to-video generation. These models are built upon **Wan2.1** (1.3B and 14B) and **Wan2.2** (14B) architectures.
+### Qualitative Comparisons
+Below are some comparisons with the state-of-the-art method VACE on video customization:
+#### (a) 2.1-1.3B model
 <p align="center">
 <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
 </table>
 </p>
+#### (b) 2.1-14B model
 <p align="center">
 <table border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:0;">
     <td align="center" style="border:0;padding-top:6px;font-weight:700;">VACE-2.1-14B</td>
     <td align="center" style="border:0;padding-top:6px;font-weight:700;">OmniVCus-2.1-14B (Ours)</td>
   </tr>
 </table>
 </p>
+## Links
+- **GitHub Repository**: [Open-OmniVCus](https://github.com/caiyuanhao1998/Open-OmniVCus)
+- **Training Data**: [OmniVCus-Train](https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Train)
+- **Testing Data**: [OmniVCus-Test](https://huggingface.co/datasets/CaiYuanhao/OmniVCus-Test)
+- **Project Page**: [OmniVCus Project](https://caiyuanhao1998.github.io/project/OmniVCus/)
+- **Arxiv Paper**: [2506.23361](https://arxiv.org/abs/2506.23361)
 ## Citation
 If you find our code, data, and models useful, please consider citing our paper:
+```bibtex
 @inproceedings{omnivcus,
   title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
   author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
   booktitle={NeurIPS},
   year={2025}
 }
+```
+`Acknowledgments:` Our code is built upon and inspired by [Wan2.1](https://github.com/Wan-Video/Wan2.1), [Wan2.2](https://github.com/Wan-Video/Wan2.2), [VACE](https://github.com/ali-vilab/VACE), [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio), [SAM2](https://github.com/facebookresearch/sam2), [Depth-Anything-V2](https://github.com/DepthAnything/Depth-Anything-V2), [Video-Depth-Anything](https://github.com/DepthAnything/Video-Depth-Anything), and [CoTracker3](https://github.com/facebookresearch/co-tracker).