qth commited on
Commit
7f4221e
·
verified ·
1 Parent(s): 0647c46

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)
2
+
3
+ <div align="center">
4
+ <a href='http://arxiv.org/abs/2503.19881'><img src='https://img.shields.io/badge/arXiv-2503.19881-b31b1b.svg'></a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
5
+ <a href='https://tianhao-qi.github.io/Mask2DiTProject/'><img src='https://img.shields.io/badge/Project%20Page-Mask²DiT-Green'></a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
6
+ <a href='https://huggingface.co/qth/Mask2DiT'>
7
+ <img src='https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=yellow'>
8
+ </a>
9
+
10
+ _**[Tianhao Qi*](https://tianhao-qi.github.io/), [Jianlong Yuan✝](https://scholar.google.com.tw/citations?user=vYe1uCQAAAAJ&hl=zh-CN), [Wanquan Feng](https://wanquanf.github.io/), [Shancheng Fang✉](https://scholar.google.com/citations?user=8Efply8AAAAJ&hl=zh-CN), [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ&hl=en&authuser=1), <br>[SiYu Zhou](https://openreview.net/profile?id=~SiYu_Zhou3), [Qian He](https://scholar.google.com/citations?view_op=list_works&hl=zh-CN&authuser=1&user=9rWWCgUAAAAJ), [Hongtao Xie](https://imcc.ustc.edu.cn/_upload/tpl/0d/13/3347/template3347/xiehongtao.html), [Yongdong Zhang](https://scholar.google.com.hk/citations?user=hxGs4ukAAAAJ&hl=zh-CN)**_
11
+ <br><br>
12
+ (*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)
13
+
14
+ From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.
15
+
16
+ </div>
17
+
18
+ ## 🔆 Introduction
19
+
20
+ **TL;DR:** We present **Mask²DiT**, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both **synthesizing a fixed number of scenes** and **auto-regressively expanding new scenes**, advancing the scalability and continuity of long video synthesis. <br>
21
+
22
+ ### ⭐⭐ Fixed-Scene Video Generation.
23
+
24
+ <div align="center">
25
+ <video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/fixed_scene_generation.mp4" width="640" controls autoplay loop muted></video>
26
+ <p> Videos generated with a <b>fixed number of scenes</b> using Mask²DiT.<br> Each scene maintains coherent appearance and motion across temporal boundaries. </p> </div>
27
+
28
+ ### ⭐⭐ Auto-Regressive Scene Expansion.
29
+
30
+ <div align="center">
31
+ <video src="https://huggingface.co/qth/Mask2DiT/resolve/main/asset/autoregressive_scene_expansion.mp4" width="640" controls autoplay loop muted></video>
32
+ <p> Mask²DiT <b>extends multi-scene narratives</b> auto-regressively,<br> producing long and coherent videos with evolving context. </p> </div>
33
+
34
+ ## 📝 Changelog
35
+ - __[2025.10.15]__: 🔥🔥 Release the code and checkpoint.
36
+ - __[2025.03.26]__: 🔥🔥 Release the arxiv paper and project page.
37
+
38
+ ## 🧩 Inference
39
+
40
+ We provide two inference pipelines for long video generation:
41
+ - 🎬 Fixed-Scene Generation — generate videos with a fixed number of scenes.
42
+ - 🔄 Auto-Regressive Scene Expansion — expand scenes continuously based on previous context.
43
+
44
+ ---
45
+
46
+ ### 1️⃣ Prepare Pretrained Model
47
+
48
+ Download the pretrained model from [Hugging Face](https://huggingface.co/qth/Mask2DiT/tree/main) and place it under:
49
+ ```
50
+ ./models/
51
+ ```
52
+
53
+ ### 2️⃣ Environment Setup
54
+
55
+ We recommend using a virtual environment to install the required dependencies. You can create a virtual environment using `conda` as follows:
56
+ ```bash
57
+ conda create -n mask2dit python=3.11.2
58
+ conda activate mask2dit
59
+ pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
60
+ pip install -r requirements.txt
61
+ ```
62
+
63
+ ---
64
+
65
+ ### 🎬 Fixed-Scene Video Generation
66
+
67
+ Use this script to synthesize videos with a fixed number of scenes:
68
+ ```python3
69
+ python examples/cogvideox_fun/predict_multi_scene_t2v_mask2dit.py
70
+ ```
71
+
72
+ #### 📦 Output:
73
+
74
+ Generated multi-scene video will be saved under samples/mask2dit-cogvideox-5b-multi-scene-t2v.
75
+
76
+ ---
77
+
78
+ ### 🔄 Auto-Regressive Scene Expansion
79
+
80
+ Use this script to expand video scenes sequentially based on the given context.
81
+ ```python3
82
+ python examples/cogvideox_fun/predict_autoregressive_scene_expansion_mask2dit.py
83
+ ```
84
+
85
+ #### 📦 Output:
86
+
87
+ This mode auto-regressively extends the video while maintaining global temporal consistency, storing the expanded video under samples/mask2dit-cogvideox-5b-autoregressive-scene-expansion.
88
+
89
+ ## 🧑‍🏫 Training
90
+
91
+ ### 1️⃣ Prepare Training Data
92
+
93
+ Please prepare your datasets following the provided examples:
94
+ - datasets/pretrain.csv → used for pretraining
95
+ - datasets/sft.json → used for supervised fine-tuning (SFT)
96
+
97
+ 💡 You can modify these template files to fit your own dataset paths and captions.
98
+
99
+ ### 2️⃣ Pretraining
100
+
101
+ We pretrain Mask²DiT using the provided datasets/pretrain.csv. Use the following script to start pretraining:
102
+ ```bash
103
+ bash scripts/cogvideox_fun/train_mask2dit_pretrain.sh
104
+ ```
105
+
106
+ ### 3️⃣ Supervised Fine-Tuning (SFT)
107
+
108
+ After pretraining, we fine-tune Mask²DiT using the datasets/sft.json. Use the following script to start SFT:
109
+ ```bash
110
+ bash scripts/cogvideox_fun/train_mask2dit_sft.sh
111
+ ```
112
+
113
+ ## Bibtex
114
+ If you find our work useful for your research, welcome to cite our work using the following BibTeX:
115
+ ```bibtex
116
+ @inproceedings{qi2025mask,
117
+ title={Mask\^{} 2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation},
118
+ author={Qi, Tianhao and Yuan, Jianlong and Feng, Wanquan and Fang, Shancheng and Liu, Jiawei and Zhou, SiYu and He, Qian and Xie, Hongtao and Zhang, Yongdong},
119
+ booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
120
+ pages={18837--18846},
121
+ year={2025}
122
+ }
123
+ ```