Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,123 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)
|
| 2 |
+
|
| 3 |
+
<div align="center">
|
| 4 |
+
<a href='http://arxiv.org/abs/2503.19881'><img src='https://img.shields.io/badge/arXiv-2503.19881-b31b1b.svg'></a>
|
| 5 |
+
<a href='https://tianhao-qi.github.io/Mask2DiTProject/'><img src='https://img.shields.io/badge/Project%20Page-Mask²DiT-Green'></a>
|
| 6 |
+
<a href='https://huggingface.co/qth/Mask2DiT'>
|
| 7 |
+
<img src='https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=yellow'>
|
| 8 |
+
</a>
|
| 9 |
+
|
| 10 |
+
_**[Tianhao Qi*](https://tianhao-qi.github.io/), [Jianlong Yuan✝](https://scholar.google.com.tw/citations?user=vYe1uCQAAAAJ&hl=zh-CN), [Wanquan Feng](https://wanquanf.github.io/), [Shancheng Fang✉](https://scholar.google.com/citations?user=8Efply8AAAAJ&hl=zh-CN), [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ&hl=en&authuser=1), <br>[SiYu Zhou](https://openreview.net/profile?id=~SiYu_Zhou3), [Qian He](https://scholar.google.com/citations?view_op=list_works&hl=zh-CN&authuser=1&user=9rWWCgUAAAAJ), [Hongtao Xie](https://imcc.ustc.edu.cn/_upload/tpl/0d/13/3347/template3347/xiehongtao.html), [Yongdong Zhang](https://scholar.google.com.hk/citations?user=hxGs4ukAAAAJ&hl=zh-CN)**_
|
| 11 |
+
<br><br>
|
| 12 |
+
(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)
|
| 13 |
+
|
| 14 |
+
From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.
|
| 15 |
+
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
## 🔆 Introduction
|
| 19 |
+
|
| 20 |
+
**TL;DR:** We present **Mask²DiT**, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both **synthesizing a fixed number of scenes** and **auto-regressively expanding new scenes**, advancing the scalability and continuity of long video synthesis. <br>
|
| 21 |
+
|
| 22 |
+
### ⭐⭐ Fixed-Scene Video Generation.
|
| 23 |
+
|
| 24 |
+
<div align="center">
|
| 25 |
+
<video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/fixed_scene_generation.mp4" width="640" controls autoplay loop muted></video>
|
| 26 |
+
<p> Videos generated with a <b>fixed number of scenes</b> using Mask²DiT.<br> Each scene maintains coherent appearance and motion across temporal boundaries. </p> </div>
|
| 27 |
+
|
| 28 |
+
### ⭐⭐ Auto-Regressive Scene Expansion.
|
| 29 |
+
|
| 30 |
+
<div align="center">
|
| 31 |
+
<video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/autoregressive_scene_expansion.mp4" width="640" controls autoplay loop muted></video>
|
| 32 |
+
<p> Mask²DiT <b>extends multi-scene narratives</b> auto-regressively,<br> producing long and coherent videos with evolving context. </p> </div>
|
| 33 |
+
|
| 34 |
+
## 📝 Changelog
|
| 35 |
+
- __[2025.10.15]__: 🔥🔥 Release the code and checkpoint.
|
| 36 |
+
- __[2025.03.26]__: 🔥🔥 Release the arxiv paper and project page.
|
| 37 |
+
|
| 38 |
+
## 🧩 Inference
|
| 39 |
+
|
| 40 |
+
We provide two inference pipelines for long video generation:
|
| 41 |
+
- 🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
|
| 42 |
+
- 🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
### 1️⃣ Prepare Pretrained Model
|
| 47 |
+
|
| 48 |
+
Download the pretrained model from [Hugging Face](https://huggingface.co/qth/Mask2DiT/tree/main) and place it under:
|
| 49 |
+
```
|
| 50 |
+
./models/
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
### 2️⃣ Environment Setup
|
| 54 |
+
|
| 55 |
+
We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using `conda` as follows:
|
| 56 |
+
```bash
|
| 57 |
+
conda create -n mask2dit python=3.11.2
|
| 58 |
+
conda activate mask2dit
|
| 59 |
+
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
|
| 60 |
+
pip install -r requirements.txt
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
### 🎬 Fixed-Scene Video Generation
|
| 66 |
+
|
| 67 |
+
Use this script to synthesize videos with a fixed number of scenes:
|
| 68 |
+
```python3
|
| 69 |
+
python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.py
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
#### 📦 Output:
|
| 73 |
+
|
| 74 |
+
Generated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
### 🔄 Auto-Regressive Scene Expansion
|
| 79 |
+
|
| 80 |
+
Use this script to expand video scenes sequentially based on the given context.
|
| 81 |
+
```python3
|
| 82 |
+
python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.py
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
#### 📦 Output:
|
| 86 |
+
|
| 87 |
+
This mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.
|
| 88 |
+
|
| 89 |
+
## 🧑🏫 Training
|
| 90 |
+
|
| 91 |
+
### 1️⃣ Prepare Training Data
|
| 92 |
+
|
| 93 |
+
Please prepare your datasets following the provided examples:
|
| 94 |
+
- datasets/pretrain.csv → used for pretraining
|
| 95 |
+
- datasets/sft.json → used for supervised fine-tuning (SFT)
|
| 96 |
+
|
| 97 |
+
💡 You can modify these template files to fit your own dataset paths and captions.
|
| 98 |
+
|
| 99 |
+
### 2️⃣ Pretraining
|
| 100 |
+
|
| 101 |
+
We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:
|
| 102 |
+
```bash
|
| 103 |
+
bash scripts/cogvideox_fun/train_mask2dit_pretrain.sh
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### 3️⃣ Supervised Fine-Tuning (SFT)
|
| 107 |
+
|
| 108 |
+
After pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:
|
| 109 |
+
```bash
|
| 110 |
+
bash scripts/cogvideox_fun/train_mask2dit_sft.sh
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## Bibtex
|
| 114 |
+
If you find our work useful for your research, welcome to cite our work using the following BibTeX:
|
| 115 |
+
```bibtex
|
| 116 |
+
@inproceedings{qi2025mask,
|
| 117 |
+
title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
|
| 118 |
+
author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
|
| 119 |
+
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
|
| 120 |
+
pages={18837--18846},
|
| 121 |
+
year={2025}
|
| 122 |
+
}
|
| 123 |
+
```
|