Update README.md
Browse files
README.md
CHANGED
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- BianYx/VAP-Data
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
base_model:
|
| 8 |
+
- zai-org/CogVideoX-5b-I2V
|
| 9 |
+
pipeline_tag: image-to-video
|
| 10 |
+
library_name: diffusers
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
<div align="center">
|
| 16 |
+
|
| 17 |
+
# Video-As-Prompt: Unified Semantic Control for Video Generation
|
| 18 |
+
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
<div align="center">
|
| 23 |
+
<a href=https://bytedance.github.io/Video-As-Prompt target="_blank"><img src=https://img.shields.io/badge/Project%20Page-333399.svg?logo=homepage height=22px></a>
|
| 24 |
+
<a href=https://huggingface.co/collections/ByteDance/video-as-prompt target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
|
| 25 |
+
<a href=https://huggingface.co/datasets/BianYx/VAP-Data target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-276cb4.svg height=22px></a>
|
| 26 |
+
<a href=https://github.com/bytedance/Video-As-Prompt target="_blank"><img src= https://img.shields.io/badge/Code-black.svg?logo=github height=22px></a>
|
| 27 |
+
<a href=https://yxbian23.github.io/ target="_blank"><img src=https://img.shields.io/badge/Arxiv-b5212f.svg?logo=arxiv height=22px></a>
|
| 28 |
+
<!-- <a href=https://yxbian23.github.io/ target="_blank"><img src=https://img.shields.io/badge/Twitter-grey.svg?logo=x height=22px></a> -->
|
| 29 |
+
<!-- <a href="https://opensource.org/licenses/Apache">
|
| 30 |
+
<img src="https://img.shields.io/badge/License-Apache%202.0-lightgray">
|
| 31 |
+
</a> -->
|
| 32 |
+
<a href="https://yxbian23.github.io/" target="_blank">
|
| 33 |
+
<img src="https://img.shields.io/badge/%E2%96%B6%20YouTube%20Demo-FF0000.svg?logo=youtube&logoColor=white" height="24px">
|
| 34 |
+
</a>
|
| 35 |
+
</div>
|
| 36 |
+
|
| 37 |
+
<br>
|
| 38 |
+
|
| 39 |
+
## π₯ News
|
| 40 |
+
|
| 41 |
+
- Oct 24, 2025: π We release the first unified semantic video generation model, [Video-As-Prompt (VAP)](https://github.com/bytedance/Video-As-Prompt)!
|
| 42 |
+
- Oct 24, 2025: π€ We release the [VAP-Data](https://huggingface.co/datasets/BianYx/VAP-Data), the largest semantic-controlled video generation datasets with more than $100K$ samples!
|
| 43 |
+
- Oct 24, 2025: π We present the [technical report](https://yxbian23.github.io/) of Video-As-Prompt, please check out the details and spark some discussion!
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
## ποΈ **Video-As-Prompt**
|
| 48 |
+
|
| 49 |
+
> **Core idea:** Given a reference video with wanted semantics as a video prompt, Video-As-Prompt animate a reference image with the same semantics as the reference video.
|
| 50 |
+
<p align="center">
|
| 51 |
+
<video
|
| 52 |
+
controls
|
| 53 |
+
autoplay
|
| 54 |
+
playsinline
|
| 55 |
+
muted
|
| 56 |
+
loop
|
| 57 |
+
src="https://github.com/user-attachments/assets/2e440927-5b16-4761-ad1f-46ac93de2d8e"
|
| 58 |
+
width="60%"
|
| 59 |
+
>
|
| 60 |
+
Your browser does not support HTML5 video. Here is a <a href="https://github.com/user-attachments/assets/2e440927-5b16-4761-ad1f-46ac93de2d8e">link to the video</a> instead.
|
| 61 |
+
</video>
|
| 62 |
+
<br>
|
| 63 |
+
<em>E.g., Different Reference Videos + Same Reference Image β New Videos with Different Semantics</em>
|
| 64 |
+
</p>
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
> **Welcome to see our [project page](https://bytedance.github.io/Video-As-Prompt) for more interesting results!**
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
## π Models Zoo
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
To demonstrate cross-architecture generality, **Video-As-Prompt** provides two variants, each with distinct trade-offs:
|
| 76 |
+
|
| 77 |
+
* **`CogVideoX-I2V-5B`**
|
| 78 |
+
|
| 79 |
+
* **Strengths:** Fewer backbone parameters let us train more steps under limited resources, yielding strong stability on most semantic conditions.
|
| 80 |
+
* **Limitations:** Due to backbone ability limitation, it is weaker on human-centric generation and on concepts underrepresented in pretraining (e.g., *ladudu*, *Squid Game*, *Minecraft*).
|
| 81 |
+
|
| 82 |
+
* **`Wan2.1-I2V-14B`**
|
| 83 |
+
|
| 84 |
+
* **Strengths:** Strong performance on human actions and novel concepts, thanks to a more capable base model.
|
| 85 |
+
* **Limitations:** Larger model size reduced feasible training steps given our resources, lowering stability on some semantic conditions.
|
| 86 |
+
|
| 87 |
+
> πππ Contributions and further optimization from the community are welcome.
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
| Model | Date | Size | Huggingface |
|
| 91 |
+
|----------------------------|------------|------|-------------------------------------------------------------------------------------------|
|
| 92 |
+
| Video-As-Prompt (CogVideoX-I2V-5B) | 2025-10-15 | 5B (Pretrained DiT) + 5B (VAP) | [Download](https://huggingface.co/ByteDance/Video-As-Prompt-CogVideoX-5B) |
|
| 93 |
+
| Video-As-Prompt (Wan2.1-I2V-14B) | 2025-10-15 | 14B (Pretrained DiT) + 5B (VAP) | [Download](https://huggingface.co/ByteDance/Video-As-Prompt-Wan2.1-14B) |
|
| 94 |
+
|
| 95 |
+
Please download the pre-trained video DiTs and our corresponding Video-As-Prompt models, and structure them as follows
|
| 96 |
+
```
|
| 97 |
+
ckpts/
|
| 98 |
+
βββ Video-As-Prompt-CogVideoX-5B/
|
| 99 |
+
βββ scheduler
|
| 100 |
+
βββ vae
|
| 101 |
+
βββ transformer
|
| 102 |
+
βββ ...
|
| 103 |
+
βββ Video-As-Prompt-Wan2.1-14B/
|
| 104 |
+
βββ scheduler
|
| 105 |
+
βββ vae
|
| 106 |
+
βββ transformer
|
| 107 |
+
βββ ...
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
## π€ Get Started with Video-As-Prompt
|
| 111 |
+
|
| 112 |
+
Video-As-Prompt supports Macos, Windows, Linux. You may follow the next steps to use Video-As-Prompt via:
|
| 113 |
+
|
| 114 |
+
### Install Requirements
|
| 115 |
+
We test our model with Python 3.10 and PyTorch 2.7.1+cu124.
|
| 116 |
+
```bash
|
| 117 |
+
conda create -n video_as_prompt python=3.10 -y
|
| 118 |
+
conda activate video_as_prompt
|
| 119 |
+
pip install -r requirements.txt
|
| 120 |
+
pip install -e ./diffusers
|
| 121 |
+
conda install -c conda-forge ffmpeg -y
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### Data
|
| 125 |
+
We have published the VAP-Data dataset used in our paper on [VAP-Data](https://huggingface.co/datasets/BianYx/VAP-Data). Please download it and put it in the `data` folder. The structure should look like:
|
| 126 |
+
```
|
| 127 |
+
data/
|
| 128 |
+
βββ VAP-Data/
|
| 129 |
+
β βββ vfx_videos/
|
| 130 |
+
β βββ vfx_videos_hq/
|
| 131 |
+
β βββ vfx_videos_hq_camera/
|
| 132 |
+
β βββ benchmark/benchmark.csv
|
| 133 |
+
β βββ vap_data.csv
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
### Code Usage
|
| 138 |
+
|
| 139 |
+
We mainly implement our code based on [diffusers](https://github.com/huggingface/diffusers) and [finetrainers](https://github.com/huggingface/finetrainers) for their modular design.
|
| 140 |
+
|
| 141 |
+
#### Minimal Demo
|
| 142 |
+
|
| 143 |
+
Below is a minimal demo of our CogVideoX-I2V-5B variant. The full code can be found in [infer/cog_vap.py](infer/cog_vap.py). The WAN2.1-I2V-14B variant is similar and can be found in [infer/wan_vap.py](infer/wan_vap.py).
|
| 144 |
+
```python
|
| 145 |
+
import torch
|
| 146 |
+
from diffusers import (
|
| 147 |
+
AutoencoderKLCogVideoX,
|
| 148 |
+
CogVideoXImageToVideoMOTPipeline,
|
| 149 |
+
CogVideoXTransformer3DMOTModel,
|
| 150 |
+
)
|
| 151 |
+
from diffusers.utils import export_to_video, load_video
|
| 152 |
+
from PIL import Image
|
| 153 |
+
|
| 154 |
+
vae = AutoencoderKLCogVideoX.from_pretrained("ByteDance/Video-As-Prompt-CogVideoX-5B", subfolder="vae", torch_dtype=torch.bfloat16)
|
| 155 |
+
transformer = CogVideoXTransformer3DMOTModel.from_pretrained("ByteDance/Video-As-Prompt-CogVideoX-5B", torch_dtype=torch.bfloat16)
|
| 156 |
+
pipe = CogVideoXImageToVideoMOTPipeline.from_pretrained(
|
| 157 |
+
"ByteDance/Video-As-Prompt-CogVideoX-5B", vae=vae, transformer=transformer, torch_dtype=torch.bfloat16
|
| 158 |
+
).to("cuda")
|
| 159 |
+
|
| 160 |
+
ref_video = load_video("assets/videos/demo/object-725.mp4")
|
| 161 |
+
image = Image.open("assets/images/demo/animal-2.jpg").convert("RGB")
|
| 162 |
+
idx = torch.linspace(0, len(ref_video) - 1, 49).long().tolist()
|
| 163 |
+
ref_frames = [ref_video[i] for i in idx]
|
| 164 |
+
|
| 165 |
+
output_frames = pipe(
|
| 166 |
+
image=image,
|
| 167 |
+
ref_videos=[ref_frames],
|
| 168 |
+
prompt="A chestnut-colored horse stands on a grassy hill against a backdrop of distant, snow-dusted mountains. The horse begins to inflate, its defined, muscular body swelling and rounding into a smooth, balloon-like form while retaining its rich, brown hide color. Without changing its orientation, the now-buoyant horse lifts silently from the ground. It begins a steady vertical ascent, rising straight up and eventually floating out of the top of the frame. The camera remains completely static throughout the entire sequence, holding a fixed shot on the landscape as the horse transforms and departs, ensuring the verdant hill and mountain range in the background stay perfectly still.",
|
| 169 |
+
prompt_mot_ref=[
|
| 170 |
+
"A hand holds up a single beige sneaker decorated with gold calligraphy and floral illustrations, with small green plants tucked inside. The sneaker immediately begins to inflate like a balloon, its shape distorting as the decorative details stretch and warp across the expanding surface. It rapidly transforms into a perfectly smooth, matte beige sphere, inheriting the primary color from the original shoe. Once the transformation is complete, the new balloon-like object quickly ascends, moving straight up and exiting the top of the frame. The camera remains completely static and the plain white background is unchanged throughout the entire sequence."
|
| 171 |
+
],
|
| 172 |
+
height=480,
|
| 173 |
+
width=720,
|
| 174 |
+
num_frames=49,
|
| 175 |
+
frames_selection="evenly",
|
| 176 |
+
use_dynamic_cfg=True,
|
| 177 |
+
).frames[0]
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
#### Benchmark Inference
|
| 181 |
+
You can alse refer the following code for benchmark inference. Then you can use [Vbench](https://github.com/Vchitect/VBench) to evaluate the results.
|
| 182 |
+
|
| 183 |
+
```python
|
| 184 |
+
python infer/cog_vap_bench.py
|
| 185 |
+
python infer/wan_vap_bench.py
|
| 186 |
+
```
|
| 187 |
+
> Welcome to modify the scripts to see more results in our dataset VAP-Data and even in-the-wild reference videos or images.
|
| 188 |
+
|
| 189 |
+
#### Training
|
| 190 |
+
|
| 191 |
+
Pick a recipe, then run the corresponding script. Each script sets sensible defaults; override as needed.
|
| 192 |
+
|
| 193 |
+
**Recipes β CogVideoX-I2V-5B**
|
| 194 |
+
|
| 195 |
+
| Goal | Nodes | Objective | References / sample | Script |
|
| 196 |
+
| ----------------------- | ----- | --------- | ------------------- | ------------------------------------------------------------------- |
|
| 197 |
+
| Standard SFT | 1 | SFT | 1 | `examples/training/sft/cogvideox/vap_mot/train_single_node.sh` |
|
| 198 |
+
| Standard SFT | β₯2 | SFT | 1 | `examples/training/sft/cogvideox/vap_mot/train_multi_node.sh` |
|
| 199 |
+
| Preference optimization | 1 | DPO | 1 | `examples/training/sft/cogvideox/vap_mot/train_single_node_dpo.sh` |
|
| 200 |
+
| Preference optimization | β₯2 | DPO | 1 | `examples/training/sft/cogvideox/vap_mot/train_multi_node_dpo.sh` |
|
| 201 |
+
| Multi-reference SFT | 1 | SFT | β€3 | `examples/training/sft/cogvideox/vap_mot/train_single_node_3ref.sh` |
|
| 202 |
+
|
| 203 |
+
> DPO and multi-reference SFT are just our exploration. We provide the code for boost of the community research.
|
| 204 |
+
|
| 205 |
+
**Recipes β Wan2.1-I2V-14B (SFT only)**
|
| 206 |
+
|
| 207 |
+
| Goal | Nodes | Objective | References / sample | Script |
|
| 208 |
+
| ------------ | ----- | --------- | ------------------- | -------------------------------------------------------- |
|
| 209 |
+
| Standard SFT | 1 | SFT | 1 | `examples/training/sft/wan/vap_mot/train_single_node.sh` |
|
| 210 |
+
| Standard SFT | β₯2 | SFT | 1 | `examples/training/sft/wan/vap_mot/train_multi_node.sh` |
|
| 211 |
+
|
| 212 |
+
**Quick start (CogVideoX-5B, single-node SFT)**
|
| 213 |
+
|
| 214 |
+
```bash
|
| 215 |
+
bash examples/training/sft/cogvideox/vap_mot/train_single_node.sh
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
**Quick start (Wan2.1-14B, single-node SFT)**
|
| 219 |
+
|
| 220 |
+
```bash
|
| 221 |
+
bash examples/training/sft/wan/vap_mot/train_single_node.sh
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
**Multi-node launch (example)**
|
| 225 |
+
|
| 226 |
+
```bash
|
| 227 |
+
# 6 nodes
|
| 228 |
+
bash examples/training/sft/cogvideox/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 0
|
| 229 |
+
bash examples/training/sft/cogvideox/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 1
|
| 230 |
+
...
|
| 231 |
+
bash examples/training/sft/cogvideox/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 5
|
| 232 |
+
# or for Wan:
|
| 233 |
+
# examples/training/sft/wan/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 0
|
| 234 |
+
# examples/training/sft/wan/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 1
|
| 235 |
+
...
|
| 236 |
+
# examples/training/sft/wan/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 5
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
**Notes**
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
* CogVideoX supports SFT, DPO, and a β€3-reference SFT variant; Wan currently supports **standard SFT only**.
|
| 243 |
+
* All scripts read shared config (datasets, output dir, batch size, etc.); edit the script to override.
|
| 244 |
+
* Please edit `train_multi_node*.sh` base on your environment if you want to change the distributed settings (e.g., gpu num, node num, master addr/port, etc.).
|
| 245 |
+
|
| 246 |
+
<!--
|
| 247 |
+
## π BibTeX
|
| 248 |
+
|
| 249 |
+
If you found this repository helpful, please cite our report:
|
| 250 |
+
|
| 251 |
+
```bibtex
|
| 252 |
+
|
| 253 |
+
``` -->
|
| 254 |
+
|
| 255 |
+
## Acknowledgements
|
| 256 |
+
|
| 257 |
+
We would like to thank the contributors to the [Finetrainers](https://github.com/huggingface/finetrainers), [Diffusers](https://github.com/huggingface/diffusers), [CogVideoX](https://github.com/zai-org/CogVideo), and [Wan](https://github.com/Wan-Video/Wan2.1) repositories, for their open research and exploration.
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
<!-- ## Star History
|
| 261 |
+
|
| 262 |
+
<a href="https://star-history.com/#bytedance/Video-As-Prompt&Date">
|
| 263 |
+
<picture>
|
| 264 |
+
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=bytedance/Video-As-Prompt&type=Date&theme=dark" />
|
| 265 |
+
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=bytedance/Video-As-Prompt&type=Date" />
|
| 266 |
+
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=bytedance/Video-As-Prompt&type=Date" />
|
| 267 |
+
</picture>
|
| 268 |
+
</a> -->
|