File size: 9,056 Bytes
501fee5 9cbabc9 501fee5 9cbabc9 501fee5 9cbabc9 501fee5 9cbabc9 501fee5 9cbabc9 501fee5 9cbabc9 501fee5 9cbabc9 501fee5 9cbabc9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
pipeline_tag: text-to-video
---
<p align="center" >
<img src="assets/logo.png" width="30%" >
</p>
# <div align="center" >Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives<div align="center">
<!-- ### <div align="center"> SIGGRAPH Asia 2025 </div> -->
<div align="center">
<p>
<a href="https://sihuiji.github.io/">Sihui Ji</a><sup>1</sup>
<a href="https://xavierchen34.github.io/">Xi Chen</a><sup>1</sup>
<a href="https://andysonys.github.io/">Shuai Yang</a><sup>3</sup>
<a href="https://www.xtao.website/">Xin Tao</a><sup>2</sup>
<a href="https://magicwpf.github.io/">Pengfei Wan</a><sup>2</sup><br>
<!-- <a href="https://openreview.net/profile?id=~Di_ZHANG3">Di Zhang</a><sup>3</sup>
<a href="https://openreview.net/profile?id=~Kun_Gai1">Kun Gai</a><sup>3</sup> -->
<a href="https://hszhao.github.io/">Hengshuang Zhao</a><sup>1✉</sup>
</p>
<p>
<sup>1</sup>The University of Hong Kong
<sup>2</sup>Kling Team, Kuaishou Technology<br>
<sup>3</sup>Hong Kong University of Science and Technology (Guangzhou)
<!-- <sup>3</sup>HKUST(GZ) -->
<sup>✉</sup>Corresponding author
</p>
</div>
<p align="center">
<a href='https://sihuiji.github.io/MemFlow.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
<a href='https://www.youtube.com/watch?v=7l7-WlIrgHg'><img src='https://img.shields.io/static/v1?label=Youtube&message=DemoVideo&color=yellow&logo=youtube'></a>
<a href="https://huggingface.co/papers/2512.14699"><img src="https://img.shields.io/badge/Paper-MemFlow-red?logo=huggingface"></a>
<a href='https://github.com/KlingTeam/MemFlow'><img src='https://img.shields.io/badge/GitHub-Code-blue?logo=github'></a>
<a href='https://huggingface.co/KlingTeam/MemFlow'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-orange'></a>
</p>
<!-- **Note:** This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying I2V model's performance, the open-source version may not achieve the same performance as the model in our paper. -->
## 🔥 Updates
- __[2025.12.14]__: Training and inference code, [model checkpoints](https://huggingface.co/KlingTeam/MemFlow) are available.
<!-- - __[2025.09.25]__: [CamCloneMaster](https://arxiv.org/abs/2506.03140) has been accepted by SIGGRAPH Aisa 2025. -->
<!-- - __[2025.09.08]__: [CameraClone Dataset](https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset/) is avaliable. -->
- __[2025.12.14]__: Release the [project page](https://sihuiji.github.io/MemFlow.github.io/) and the [Paper](https://huggingface.co/papers/2512.14699) version.
## 📷 Introduction
**TL;DR:**
We propose MemFlow to address the core challenge of long-context consistency and narrative coherence in streaming video generation.
Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk.
In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency.
In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
<div align="center">
[](https://www.youtube.com/watch?v=7l7-WlIrgHg)
</div>
## 📌 Highlights
1. **Long Context Memory with Limited Capacity**: MemFlow maintains long-range memory for visual consistency with highly constrained capacity to guarantee lightweight computation and storage.
2. **Adaptive Retrieval for Narrative Coherence**:
MemFlow dynamically retrieves the most relevant historical frames from memory with text prompt of the coming chunk to ensure narrative coherence.
3. **Efficient and Real-time Inference**:
Memflow supports real-time generation with 18.7 FPS on a single H100 GPU, sacrificing only 7.9% speed compared to the memory-free baseline.
<!-- ## 🌄 Gallery -->
<!-- ## 📑 Open-source Plan
- [x] Inference code
- [x] Model checkpoints
- [x] Training code -->
## 🛠️ Installation
**Requirements**
We tested this repo on the following setup:
* Nvidia GPU with 80 GB memory (A100, and A800 are tested).
* Linux operating system.
Other hardware setup could also work but hasn't been tested.
**Environment**
Create a conda environment and install dependencies:
```
git clone https://github.com/KlingTeam/MemFlow
cd MemFlow
conda create -n memflow python=3.10 -y
conda activate memflow
conda install nvidia/label/cuda-12.4.1::cuda
conda install -c nvidia/label/cuda-12.4.1 cudatoolkit
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```
## 🧱 Download Checkpoints
Download models using huggingface-cli:
``` sh
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download KlingTeam/MemFlow --local-dir checkpoints
```
or using git:
``` sh
git lfs install
git clone https://huggingface.co/KlingTeam/MemFlow
```
## 🔑 Inference
<!-- **Download checkpoints** -->
**Single Prompt Video Generation**
```
bash inference.sh
```
**Interactive Long Video Generation**
```
bash interactive_inference.sh
```
**Hints for video prompt**
1. For each subject and background appearing in a video, maintaining consistent descriptions across different prompts within the same video greatly improves global coherence during prompt switches. See the example for the exact prompt set we used to produce some of our videos on the demo page.
2. MemFlow supports diverse interaction—action changes, introducing/removing objects, background shifts, and more. While large-scale continuous camera motions can be achieved through appropriate cinematic language (see [`prompts/interactive_example.jsonl`](https://github.com/KlingTeam/MemFlow/blob/main/prompts/interactive_example.jsonl)), rapid shot-to-shot transitions or fast cutscene-style edits are not supported.
## ⚙️ Training
**Download checkpoints**
Please follow [Self-Forcing](https://github.com/guandeh17/Self-Forcing) to download text prompts and ODE initialized checkpoint.
Download Wan2.1-T2V-14B as the teacher model.
```
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir wan_models/Wan2.1-T2V-14B
```
**Stage 1: Self-Forcing Initialization for Memory Mechanism**
```
bash train_init.sh
```
**Stage 2: Streaming Long Tuning**
```
bash train_long.sh
```
**Hints for two stage training**
The `bank_size` is a tunable hyperparameter specified in [`configs/train_init.yaml`](https://github.com/KlingTeam/MemFlow/blob/main/configs/train_init.yaml) and [`configs/train_long.yaml`](https://github.com/KlingTeam/MemFlow/blob/main/configs/train_long.yaml). It controls the number of latent frames stored in the memory bank. When `bank_size` matches the number of latent frames of frame sink in [LongLive](https://github.com/NVlabs/LongLive) (as in our default setting), training can optionally start directly from Stage 2 (Streaming Long Tuning). Specifically, we initialize from the checkpoint [`longlive_base.pt`](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B/blob/main/models/longlive_base.pt) obtained in Stage 1 of [LongLive](https://github.com/NVlabs/LongLive) and fine-tune only the LoRA parameters, which significantly improves training efficiency.
<!-- ## How to contribute
- Make sure to have git installed.
- Create your own [fork](https://github.com/NVlabs/LongLive/fork) of the project.
- Clone the repository on your local machine, using git clone and pasting the url of this project.
- Read both the `Requirements` and `Installation and Quick Guide` sections below.
- Commit and push your changes.
- Make a pull request when finished modifying the project. -->
## 🤗 Acknowledgement
- [LongLive](https://github.com/NVlabs/LongLive): the codebase we built upon. Thanks for their wonderful work.
- [Self-Forcing](https://github.com/guandeh17/Self-Forcing): the algorithm we built upon. Thanks for their wonderful work.
- [Wan](https://github.com/Wan-Video/Wan2.1): the base model we built upon. Thanks for their wonderful work.
## 🌟 Citation
Please leave us a star 🌟 and cite our paper if you find our work helpful.
```
@misc{ji2025memflow,
title={MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives},
author={Ji, Sihui and Chen, Xi and Yang, Shuai and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
year={2025},
eprint={2512.14699},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.14699},
}
``` |