Text-to-Video

Improve model card: Add pipeline tag, update paper & GitHub links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +13 -7
README.md CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  <p align="center" >
2
  <img src="assets/logo.png" width="30%" >
3
  </p>
@@ -29,7 +33,9 @@
29
  &nbsp;
30
  <a href='https://www.youtube.com/watch?v=7l7-WlIrgHg'><img src='https://img.shields.io/static/v1?label=Youtube&message=DemoVideo&color=yellow&logo=youtube'></a>
31
  &nbsp;
32
- <a href=""><img src="https://img.shields.io/static/v1?label=Arxiv&message=MemFlow&color=red&logo=arxiv"></a>
 
 
33
  &nbsp;
34
  <a href='https://huggingface.co/KlingTeam/MemFlow'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-orange'></a>
35
  </p>
@@ -40,14 +46,14 @@
40
  - __[2025.12.14]__: Training and inference code, [model checkpoints](https://huggingface.co/KlingTeam/MemFlow) are available.
41
  <!-- - __[2025.09.25]__: [CamCloneMaster](https://arxiv.org/abs/2506.03140) has been accepted by SIGGRAPH Aisa 2025. -->
42
  <!-- - __[2025.09.08]__: [CameraClone Dataset](https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset/) is avaliable. -->
43
- - __[2025.12.14]__: Release the [project page](https://sihuiji.github.io/MemFlow.github.io/) and the [Arxiv](https://arxiv.org/abs/2506.03140) version.
44
 
45
  ## 📷 Introduction
46
  **TL;DR:**
47
  We propose MemFlow to address the core challenge of long-context consistency and narrative coherence in streaming video generation.
48
  Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk.
49
  In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency.
50
- In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden and keeps the compatibility with any streaming video generation model with KV cache.
51
 
52
 
53
  <div align="center">
@@ -133,7 +139,7 @@ bash interactive_inference.sh
133
 
134
  1. For each subject and background appearing in a video, maintaining consistent descriptions across different prompts within the same video greatly improves global coherence during prompt switches. See the example for the exact prompt set we used to produce some of our videos on the demo page.
135
 
136
- 2. MemFlow supports diverse interaction—action changes, introducing/removing objects, background shifts, and more. While large-scale continuous camera motions can be achieved through appropriate cinematic language (see [`prompts/interactive_example.jsonl`](prompts/interactive_example.jsonl)), rapid shot-to-shot transitions or fast cutscene-style edits are not supported.
137
 
138
  ## ⚙️ Training
139
  **Download checkpoints**
@@ -157,7 +163,7 @@ bash train_long.sh
157
 
158
  **Hints for two stage training**
159
 
160
- The `bank_size` is a tunable hyperparameter specified in [`configs/train_init.yaml`](configs/train_init.yaml) and [`configs/train_long.yaml`](configs/train_long.yaml). It controls the number of latent frames stored in the memory bank. When `bank_size` matches the number of latent frames of frame sink in [LongLive](https://github.com/NVlabs/LongLive) (as in our default setting), training can optionally start directly from Stage 2 (Streaming Long Tuning). Specifically, we initialize from the checkpoint [`longlive_base.pt`](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B/blob/main/models/longlive_base.pt) obtained in Stage 1 of [LongLive](https://github.com/NVlabs/LongLive) and fine-tune only the LoRA parameters, which significantly improves training efficiency.
161
 
162
 
163
  <!-- ## How to contribute
@@ -182,9 +188,9 @@ Please leave us a star 🌟 and cite our paper if you find our work helpful.
182
  title={MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives},
183
  author={Ji, Sihui and Chen, Xi and Yang, Shuai and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
184
  year={2025},
185
- eprint={2512.xxxxx},
186
  archivePrefix={arXiv},
187
  primaryClass={cs.CV},
188
- url={https://arxiv.org/abs/2512.xxxxx},
189
  }
190
  ```
 
1
+ ---
2
+ pipeline_tag: text-to-video
3
+ ---
4
+
5
  <p align="center" >
6
  <img src="assets/logo.png" width="30%" >
7
  </p>
 
33
  &nbsp;
34
  <a href='https://www.youtube.com/watch?v=7l7-WlIrgHg'><img src='https://img.shields.io/static/v1?label=Youtube&message=DemoVideo&color=yellow&logo=youtube'></a>
35
  &nbsp;
36
+ <a href="https://huggingface.co/papers/2512.14699"><img src="https://img.shields.io/badge/Paper-MemFlow-red?logo=huggingface"></a>
37
+ &nbsp;
38
+ <a href='https://github.com/KlingTeam/MemFlow'><img src='https://img.shields.io/badge/GitHub-Code-blue?logo=github'></a>
39
  &nbsp;
40
  <a href='https://huggingface.co/KlingTeam/MemFlow'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-orange'></a>
41
  </p>
 
46
  - __[2025.12.14]__: Training and inference code, [model checkpoints](https://huggingface.co/KlingTeam/MemFlow) are available.
47
  <!-- - __[2025.09.25]__: [CamCloneMaster](https://arxiv.org/abs/2506.03140) has been accepted by SIGGRAPH Aisa 2025. -->
48
  <!-- - __[2025.09.08]__: [CameraClone Dataset](https://huggingface.co/datasets/KwaiVGI/CameraClone-Dataset/) is avaliable. -->
49
+ - __[2025.12.14]__: Release the [project page](https://sihuiji.github.io/MemFlow.github.io/) and the [Paper](https://huggingface.co/papers/2512.14699) version.
50
 
51
  ## 📷 Introduction
52
  **TL;DR:**
53
  We propose MemFlow to address the core challenge of long-context consistency and narrative coherence in streaming video generation.
54
  Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk.
55
  In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency.
56
+ In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
57
 
58
 
59
  <div align="center">
 
139
 
140
  1. For each subject and background appearing in a video, maintaining consistent descriptions across different prompts within the same video greatly improves global coherence during prompt switches. See the example for the exact prompt set we used to produce some of our videos on the demo page.
141
 
142
+ 2. MemFlow supports diverse interaction—action changes, introducing/removing objects, background shifts, and more. While large-scale continuous camera motions can be achieved through appropriate cinematic language (see [`prompts/interactive_example.jsonl`](https://github.com/KlingTeam/MemFlow/blob/main/prompts/interactive_example.jsonl)), rapid shot-to-shot transitions or fast cutscene-style edits are not supported.
143
 
144
  ## ⚙️ Training
145
  **Download checkpoints**
 
163
 
164
  **Hints for two stage training**
165
 
166
+ The `bank_size` is a tunable hyperparameter specified in [`configs/train_init.yaml`](https://github.com/KlingTeam/MemFlow/blob/main/configs/train_init.yaml) and [`configs/train_long.yaml`](https://github.com/KlingTeam/MemFlow/blob/main/configs/train_long.yaml). It controls the number of latent frames stored in the memory bank. When `bank_size` matches the number of latent frames of frame sink in [LongLive](https://github.com/NVlabs/LongLive) (as in our default setting), training can optionally start directly from Stage 2 (Streaming Long Tuning). Specifically, we initialize from the checkpoint [`longlive_base.pt`](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B/blob/main/models/longlive_base.pt) obtained in Stage 1 of [LongLive](https://github.com/NVlabs/LongLive) and fine-tune only the LoRA parameters, which significantly improves training efficiency.
167
 
168
 
169
  <!-- ## How to contribute
 
188
  title={MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives},
189
  author={Ji, Sihui and Chen, Xi and Yang, Shuai and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
190
  year={2025},
191
+ eprint={2512.14699},
192
  archivePrefix={arXiv},
193
  primaryClass={cs.CV},
194
+ url={https://arxiv.org/abs/2512.14699},
195
  }
196
  ```