File size: 6,038 Bytes
f8089c8 7f4221e 4052e40 7f4221e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
license: apache-2.0
tags:
- diffusion
- video-generation
- multi-scene
- autoregressive
- transformer
- computer-vision
- cvpr2025
model-index:
- name: Mask²DiT
results: []
---
# Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)
<div align="center">
<a href='http://arxiv.org/abs/2503.19881'><img src='https://img.shields.io/badge/arXiv-2503.19881-b31b1b.svg'></a>
<a href='https://tianhao-qi.github.io/Mask2DiTProject/'><img src='https://img.shields.io/badge/Project%20Page-Mask²DiT-Green'></a>
<a href='https://huggingface.co/qth/Mask2DiT'>
<img src='https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=yellow'>
</a>
_**[Tianhao Qi*](https://tianhao-qi.github.io/), [Jianlong Yuan✝](https://scholar.google.com.tw/citations?user=vYe1uCQAAAAJ&hl=zh-CN), [Wanquan Feng](https://wanquanf.github.io/), [Shancheng Fang✉](https://scholar.google.com/citations?user=8Efply8AAAAJ&hl=zh-CN), [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ&hl=en&authuser=1), <br>[SiYu Zhou](https://openreview.net/profile?id=~SiYu_Zhou3), [Qian He](https://scholar.google.com/citations?view_op=list_works&hl=zh-CN&authuser=1&user=9rWWCgUAAAAJ), [Hongtao Xie](https://imcc.ustc.edu.cn/_upload/tpl/0d/13/3347/template3347/xiehongtao.html), [Yongdong Zhang](https://scholar.google.com.hk/citations?user=hxGs4ukAAAAJ&hl=zh-CN)**_
<br><br>
(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)
From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.
</div>
## 🔆 Introduction
**TL;DR:** We present **Mask²DiT**, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both **synthesizing a fixed number of scenes** and **auto-regressively expanding new scenes**, advancing the scalability and continuity of long video synthesis. <br>
### ⭐⭐ Fixed-Scene Video Generation.
<div align="center">
<video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/fixed_scene_generation.mp4" width="640" controls autoplay loop muted></video>
<p> Videos generated with a <b>fixed number of scenes</b> using Mask²DiT.<br> Each scene maintains coherent appearance and motion across temporal boundaries. </p> </div>
### ⭐⭐ Auto-Regressive Scene Expansion.
<div align="center">
<video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/autoregressive_scene_expansion.mp4" width="640" controls autoplay loop muted></video>
<p> Mask²DiT <b>extends multi-scene narratives</b> auto-regressively,<br> producing long and coherent videos with evolving context. </p> </div>
## 📝 Changelog
- __[2025.10.15]__: 🔥🔥 Release the code and checkpoint.
- __[2025.03.26]__: 🔥🔥 Release the arxiv paper and project page.
## 🧩 Inference
We provide two inference pipelines for long video generation:
- 🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
- 🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.
---
### 1️⃣ Prepare Pretrained Model
Download the pretrained model from [Hugging Face](https://huggingface.co/qth/Mask2DiT/tree/main) and place it under:
```
./models/
```
### 2️⃣ Environment Setup
We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using `conda` as follows:
```bash
conda create -n mask2dit python=3.11.2
conda activate mask2dit
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
```
---
### 🎬 Fixed-Scene Video Generation
Use this script to synthesize videos with a fixed number of scenes:
```python3
python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.py
```
#### 📦 Output:
Generated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.
---
### 🔄 Auto-Regressive Scene Expansion
Use this script to expand video scenes sequentially based on the given context.
```python3
python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.py
```
#### 📦 Output:
This mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.
## 🧑🏫 Training
### 1️⃣ Prepare Training Data
Please prepare your datasets following the provided examples:
- datasets/pretrain.csv → used for pretraining
- datasets/sft.json → used for supervised fine-tuning (SFT)
💡 You can modify these template files to fit your own dataset paths and captions.
### 2️⃣ Pretraining
We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:
```bash
bash scripts/cogvideox_fun/train_mask2dit_pretrain.sh
```
### 3️⃣ Supervised Fine-Tuning (SFT)
After pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:
```bash
bash scripts/cogvideox_fun/train_mask2dit_sft.sh
```
## 🙏 Acknowledgement
This project is built upon the open-source repository
[VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun).
We sincerely thank the original authors for their excellent work and open-source contributions.
## Bibtex
If you find our work useful for your research, welcome to cite our work using the following BibTeX:
```bibtex
@inproceedings{qi2025mask,
title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={18837--18846},
year={2025}
}
``` |