Text-to-Video
zhengmingwu commited on
Commit
9cbabc9
·
1 Parent(s): aa00bd0

Add base checkpoint

Browse files
Files changed (3) hide show
  1. README.md +190 -3
  2. base.pt +3 -0
  3. lora.pt +3 -0
README.md CHANGED
@@ -1,3 +1,190 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center" >
2
+ <img src="assets/logo.png" width="30%" >
3
+ </p>
4
+
5
+ # <div align="center" >Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives<div align="center">
6
+
7
+ <!-- ### <div align="center"> SIGGRAPH Asia 2025 </div> -->
8
+ <div align="center">
9
+ <p>
10
+ <a href="https://sihuiji.github.io/">Sihui Ji</a><sup>1</sup>
11
+ <a href="https://xavierchen34.github.io/">Xi Chen</a><sup>1</sup>
12
+ <a href="https://andysonys.github.io/">Shuai Yang</a><sup>3</sup>
13
+ <a href="https://www.xtao.website/">Xin Tao</a><sup>2</sup>
14
+ <a href="https://magicwpf.github.io/">Pengfei Wan</a><sup>2</sup><br>
15
+ <!-- <a href="https://openreview.net/profile?id=~Di_ZHANG3">Di Zhang</a><sup>3</sup>
16
+ <a href="https://openreview.net/profile?id=~Kun_Gai1">Kun Gai</a><sup>3</sup> -->
17
+ <a href="https://hszhao.github.io/">Hengshuang Zhao</a><sup>1✉</sup>
18
+ </p>
19
+ <p>
20
+ <sup>1</sup>The University of Hong Kong &nbsp;&nbsp;
21
+ <sup>2</sup>Kling Team, Kuaishou Technology<br>
22
+ <sup>3</sup>Hong Kong University of Science and Technology (Guangzhou) &nbsp;&nbsp;
23
+ <!-- <sup>3</sup>HKUST(GZ) &nbsp;&nbsp; -->
24
+ <sup>✉</sup>Corresponding author
25
+ </p>
26
+ </div>
27
+ <p align="center">
28
+ <a href='https://sihuiji.github.io/MemFlow.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
29
+ &nbsp;
30
+ <a href='https://www.youtube.com/watch?v=7l7-WlIrgHg'><img src='https://img.shields.io/static/v1?label=Youtube&message=DemoVideo&color=yellow&logo=youtube'></a>
31
+ &nbsp;
32
+ <a href=""><img src="https://img.shields.io/static/v1?label=Arxiv&message=MemFlow&color=red&logo=arxiv"></a>
33
+ &nbsp;
34
+ <a href='https://huggingface.co/KlingTeam/MemFlow'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-orange'></a>
35
+ </p>
36
+
37
+ <!-- **Note:** This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying I2V model's performance, the open-source version may not achieve the same performance as the model in our paper. -->
38
+
39
+ ## 🔥 Updates
40
+ - __[2025.12.14]__: Training and inference code, [model checkpoints](https://huggingface.co/KlingTeam/MemFlow) are available.
41
+ <!-- - __[2025.09.25]__: [CamCloneMaster](https://arxiv.org/abs/2506.03140) has been accepted by SIGGRAPH Aisa 2025. -->
42
+ <!-- - __[2025.09.08]__: [CameraClone Dataset](https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset/) is avaliable. -->
43
+ - __[2025.12.14]__: Release the [project page](https://sihuiji.github.io/MemFlow.github.io/) and the [Arxiv](https://arxiv.org/abs/2506.03140) version.
44
+
45
+ ## 📷 Introduction
46
+ **TL;DR:**
47
+ We propose MemFlow to address the core challenge of long-context consistency and narrative coherence in streaming video generation.
48
+ Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk.
49
+ In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency.
50
+ In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden and keeps the compatibility with any streaming video generation model with KV cache.
51
+
52
+
53
+ <div align="center">
54
+
55
+ [![Watch the video](assets/cover.png)](https://www.youtube.com/watch?v=7l7-WlIrgHg)
56
+
57
+
58
+ </div>
59
+
60
+ ## &#x1F4CC; Highlights
61
+
62
+ 1. **Long Context Memory with Limited Capacity**: MemFlow maintains long-range memory for visual consistency with highly constrained capacity to guarantee lightweight computation and storage.
63
+
64
+ 2. **Adaptive Retrieval for Narrative Coherence**:
65
+ MemFlow dynamically retrieves the most relevant historical frames from memory with text prompt of the coming chunk to ensure narrative coherence.
66
+
67
+ 3. **Efficient and Real-time Inference**:
68
+ Memflow supports real-time generation with 18.7 FPS on a single H100 GPU, sacrificing only 7.9% speed compared to the memory-free baseline.
69
+
70
+
71
+ <!-- ## &#x1F304; Gallery -->
72
+
73
+
74
+
75
+ <!-- ## 📑 Open-source Plan
76
+
77
+ - [x] Inference code
78
+ - [x] Model checkpoints
79
+ - [x] Training code -->
80
+
81
+ ## 🛠️ Installation
82
+ **Requirements**
83
+
84
+ We tested this repo on the following setup:
85
+ * Nvidia GPU with 80 GB memory (A100, and A800 are tested).
86
+ * Linux operating system.
87
+
88
+ Other hardware setup could also work but hasn't been tested.
89
+
90
+ **Environment**
91
+
92
+ Create a conda environment and install dependencies:
93
+ ```
94
+ git clone https://github.com/KlingTeam/MemFlow
95
+ cd MemFlow
96
+ conda create -n memflow python=3.10 -y
97
+ conda activate memflow
98
+ conda install nvidia/label/cuda-12.4.1::cuda
99
+ conda install -c nvidia/label/cuda-12.4.1 cudatoolkit
100
+ pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
101
+ pip install -r requirements.txt
102
+ pip install flash-attn --no-build-isolation
103
+ ```
104
+
105
+ ## 🧱 Download Checkpoints
106
+
107
+ Download models using huggingface-cli:
108
+ ``` sh
109
+ pip install "huggingface_hub[cli]"
110
+ huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
111
+ huggingface-cli download KlingTeam/MemFlow --local-dir checkpoints
112
+ ```
113
+ or using git:
114
+ ``` sh
115
+ git lfs install
116
+ git clone https://huggingface.co/KlingTeam/MemFlow
117
+ ```
118
+
119
+ ## 🔑 Inference
120
+ <!-- **Download checkpoints** -->
121
+
122
+
123
+
124
+ **Single Prompt Video Generation**
125
+ ```
126
+ bash inference.sh
127
+ ```
128
+ **Interactive Long Video Generation**
129
+ ```
130
+ bash interactive_inference.sh
131
+ ```
132
+ **Hints for video prompt**
133
+
134
+ 1. For each subject and background appearing in a video, maintaining consistent descriptions across different prompts within the same video greatly improves global coherence during prompt switches. See the example for the exact prompt set we used to produce some of our videos on the demo page.
135
+
136
+ 2. MemFlow supports diverse interaction—action changes, introducing/removing objects, background shifts, and more. While large-scale continuous camera motions can be achieved through appropriate cinematic language (see [`prompts/interactive_example.jsonl`](prompts/interactive_example.jsonl)), rapid shot-to-shot transitions or fast cutscene-style edits are not supported.
137
+
138
+ ## ⚙️ Training
139
+ **Download checkpoints**
140
+
141
+ Please follow [Self-Forcing](https://github.com/guandeh17/Self-Forcing) to download text prompts and ODE initialized checkpoint.
142
+
143
+ Download Wan2.1-T2V-14B as the teacher model.
144
+
145
+ ```
146
+ huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir wan_models/Wan2.1-T2V-14B
147
+ ```
148
+
149
+ **Stage 1: Self-Forcing Initialization for Memory Mechanism**
150
+ ```
151
+ bash train_init.sh
152
+ ```
153
+ **Stage 2: Streaming Long Tuning**
154
+ ```
155
+ bash train_long.sh
156
+ ```
157
+
158
+ **Hints for two stage training**
159
+
160
+ The `bank_size` is a tunable hyperparameter specified in [`configs/train_init.yaml`](configs/train_init.yaml) and [`configs/train_long.yaml`](configs/train_long.yaml). It controls the number of latent frames stored in the memory bank. When `bank_size` matches the number of latent frames of frame sink in [LongLive](https://github.com/NVlabs/LongLive) (as in our default setting), training can optionally start directly from Stage 2 (Streaming Long Tuning). Specifically, we initialize from the checkpoint [`longlive_base.pt`](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B/blob/main/models/longlive_base.pt) obtained in Stage 1 of [LongLive](https://github.com/NVlabs/LongLive) and fine-tune only the LoRA parameters, which significantly improves training efficiency.
161
+
162
+
163
+ <!-- ## How to contribute
164
+ - Make sure to have git installed.
165
+ - Create your own [fork](https://github.com/NVlabs/LongLive/fork) of the project.
166
+ - Clone the repository on your local machine, using git clone and pasting the url of this project.
167
+ - Read both the `Requirements` and `Installation and Quick Guide` sections below.
168
+ - Commit and push your changes.
169
+ - Make a pull request when finished modifying the project. -->
170
+
171
+ ## 🤗 Acknowledgement
172
+ - [LongLive](https://github.com/NVlabs/LongLive): the codebase we built upon. Thanks for their wonderful work.
173
+ - [Self-Forcing](https://github.com/guandeh17/Self-Forcing): the algorithm we built upon. Thanks for their wonderful work.
174
+ - [Wan](https://github.com/Wan-Video/Wan2.1): the base model we built upon. Thanks for their wonderful work.
175
+
176
+
177
+ ## 🌟 Citation
178
+ Please leave us a star 🌟 and cite our paper if you find our work helpful.
179
+
180
+ ```
181
+ @misc{ji2025memflow,
182
+ title={MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives},
183
+ author={Ji, Sihui and Chen, Xi and Yang, Shuai and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
184
+ year={2025},
185
+ eprint={2512.xxxxx},
186
+ archivePrefix={arXiv},
187
+ primaryClass={cs.CV},
188
+ url={https://arxiv.org/abs/2512.xxxxx},
189
+ }
190
+ ```
base.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10a2aa8fcf89c77d9033f4c117405412a690e289625766619d293f0c5a208ee7
3
+ size 5676334208
lora.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:114cace42a4bd47aff594906e049750b47ea23268b1cf76eb381860663bda865
3
+ size 2800056690