File size: 6,038 Bytes
f8089c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f4221e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4052e40
 
 
 
 
 
7f4221e
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: apache-2.0
tags:
- diffusion
- video-generation
- multi-scene
- autoregressive
- transformer
- computer-vision
- cvpr2025
model-index:
- name: Mask²DiT
  results: []
---

# Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)

<div align="center">
 <a href='http://arxiv.org/abs/2503.19881'><img src='https://img.shields.io/badge/arXiv-2503.19881-b31b1b.svg'></a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 <a href='https://tianhao-qi.github.io/Mask2DiTProject/'><img src='https://img.shields.io/badge/Project%20Page-Mask²DiT-Green'></a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 <a href='https://huggingface.co/qth/Mask2DiT'>
  <img src='https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=yellow'>
</a>

_**[Tianhao Qi*](https://tianhao-qi.github.io/), [Jianlong Yuan✝](https://scholar.google.com.tw/citations?user=vYe1uCQAAAAJ&hl=zh-CN), [Wanquan Feng](https://wanquanf.github.io/), [Shancheng Fang✉](https://scholar.google.com/citations?user=8Efply8AAAAJ&hl=zh-CN), [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ&hl=en&authuser=1), <br>[SiYu Zhou](https://openreview.net/profile?id=~SiYu_Zhou3), [Qian He](https://scholar.google.com/citations?view_op=list_works&hl=zh-CN&authuser=1&user=9rWWCgUAAAAJ), [Hongtao Xie](https://imcc.ustc.edu.cn/_upload/tpl/0d/13/3347/template3347/xiehongtao.html), [Yongdong Zhang](https://scholar.google.com.hk/citations?user=hxGs4ukAAAAJ&hl=zh-CN)**_
<br><br>
(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)

From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.

</div>

## 🔆 Introduction

**TL;DR:** We present **Mask²DiT**, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both **synthesizing a fixed number of scenes** and **auto-regressively expanding new scenes**, advancing the scalability and continuity of long video synthesis. <br>

### ⭐⭐ Fixed-Scene Video Generation.

<div align="center">
<video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/fixed_scene_generation.mp4" width="640" controls autoplay loop muted></video>
 <p> Videos generated with a <b>fixed number of scenes</b> using Mask²DiT.<br> Each scene maintains coherent appearance and motion across temporal boundaries. </p> </div>

### ⭐⭐ Auto-Regressive Scene Expansion.

<div align="center">
<video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/autoregressive_scene_expansion.mp4" width="640" controls autoplay loop muted></video>
<p> Mask²DiT <b>extends multi-scene narratives</b> auto-regressively,<br> producing long and coherent videos with evolving context. </p> </div>

## 📝 Changelog
- __[2025.10.15]__: 🔥🔥 Release the code and checkpoint.
- __[2025.03.26]__: 🔥🔥 Release the arxiv paper and project page.

## 🧩 Inference

We provide two inference pipelines for long video generation:
- 🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
- 🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.

---

### 1️⃣ Prepare Pretrained Model

Download the pretrained model from [Hugging Face](https://huggingface.co/qth/Mask2DiT/tree/main) and place it under:
```
./models/
```

### 2️⃣ Environment Setup

We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using `conda` as follows:
```bash
conda create -n mask2dit python=3.11.2
conda activate mask2dit
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
```

---

### 🎬 Fixed-Scene Video Generation

Use this script to synthesize videos with a fixed number of scenes:
```python3
python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.py
```

#### 📦 Output:

Generated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.

---

### 🔄 Auto-Regressive Scene Expansion

Use this script to expand video scenes sequentially based on the given context.
```python3
python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.py
```

#### 📦 Output:

This mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.

## 🧑‍🏫 Training

### 1️⃣ Prepare Training Data

Please prepare your datasets following the provided examples:
- datasets/pretrain.csv → used for pretraining
- datasets/sft.json → used for supervised fine-tuning (SFT)

💡 You can modify these template files to fit your own dataset paths and captions.

### 2️⃣ Pretraining

We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:
```bash
bash scripts/cogvideox_fun/train_mask2dit_pretrain.sh
```

### 3️⃣ Supervised Fine-Tuning (SFT)

After pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:
```bash
bash scripts/cogvideox_fun/train_mask2dit_sft.sh
```

## 🙏 Acknowledgement

This project is built upon the open-source repository 
[VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun).  
We sincerely thank the original authors for their excellent work and open-source contributions.

## Bibtex
If you find our work useful for your research, welcome to cite our work using the following BibTeX:
```bibtex
@inproceedings{qi2025mask,
  title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
  author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18837--18846},
  year={2025}
}
```