Yukang commited on
Commit
919edac
·
verified ·
1 Parent(s): 5e3d59c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -20
README.md CHANGED
@@ -1,6 +1,17 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
4
 
5
  # 🎬 LongLive: Real-time Interactive Long Video Generation
6
 
@@ -10,6 +21,12 @@ license: cc-by-nc-sa-4.0
10
  [![Video](https://img.shields.io/badge/YouTube-Video-red)](https://www.youtube.com/watch?v=CO1QC7BNvig)
11
  [![Demo](https://img.shields.io/badge/Demo-Page-bron)](https://nvlabs.github.io/LongLive)
12
 
 
 
 
 
 
 
13
  ## 💡 TLDR: Turn interactive prompts into long videos—instantly, as you type!
14
 
15
  **LongLive: Real-time Interactive Long Video Generation [[Paper](https://arxiv.org/abs/xxx)]** <br />
@@ -18,7 +35,6 @@ license: cc-by-nc-sa-4.0
18
  We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU.
19
  With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.
20
 
21
-
22
  ## News
23
  - [x] [2025.9.25] We release [Paper](https://arxiv.org/abs/xxx), this GitHub repo [LongLive](https://github.com/NVlabs/LongLive) with all training and inference code, the model weight [LongLive-1.3B](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B), and demo page [Website](https://nvlabs.github.io/LongLive).
24
 
@@ -27,6 +43,32 @@ With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quali
27
  2. **Real-time Inference**: LongLive supports 20.7 FPS generation speed on a single H100 GPU, and 24.8 FPS with FP8 quantization with marginal quality loss.
28
  3. **Efficient Fine-tuning**: LongLive extends a short-clip model to minute-long generation in 32 H100 GPU-days.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Installation
32
  **Requirements**
@@ -67,25 +109,15 @@ bash inference.sh
67
  ```
68
  bash interactive_inference.sh
69
  ```
70
- ## Training
71
- **Download checkpoints**
72
-
73
- Please follow [Self-Forcing](https://github.com/guandeh17/Self-Forcing) to download text prompts and ODE initialized checkpoint.
74
 
75
- Download Wan2.1-T2V-14B as the teacher model.
 
 
 
 
 
 
76
 
77
- ```
78
- huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir wan_models/Wan2.1-T2V-14B
79
- ```
80
-
81
- **Step1: Self-Forcing Initialization for Short Window and Frame Sink**
82
- ```
83
- bash train_init.sh
84
- ```
85
- **Step2: Streaming Long Tuning**
86
- ```
87
- bash train_long.sh
88
- ```
89
 
90
  ## Citation
91
  Please consider to cite our paper and this framework, if they are helpful in your research.
@@ -101,9 +133,8 @@ Please consider to cite our paper and this framework, if they are helpful in you
101
  ```
102
 
103
  ## License
104
- - LongLive code is under CC-BY-NC-SA 4.0 license.
105
  - LongLive-1.3B model weight is under CC-BY-NC 4.0 license.
106
 
107
  ## Acknowledgement
108
  - [Self-Forcing](https://github.com/hiyouga/EasyR1): the codebase and algorithm we built upon. Thanks for their wonderful work.
109
- - [Wan](https://github.com/volcengine/verl): the base model we built upon. Thanks for their wonderful work.
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Wan-AI/Wan2.1-T2V-1.3B
7
+ tags:
8
+ - Real-time
9
+ - LongVideoGeneration
10
+ - Interactive
11
  ---
12
+ <p align="center" style="border-radius: 10px">
13
+ <img src="assets/LongLive-logo.png" width="100%" alt="logo"/>
14
+ </p>
15
 
16
  # 🎬 LongLive: Real-time Interactive Long Video Generation
17
 
 
21
  [![Video](https://img.shields.io/badge/YouTube-Video-red)](https://www.youtube.com/watch?v=CO1QC7BNvig)
22
  [![Demo](https://img.shields.io/badge/Demo-Page-bron)](https://nvlabs.github.io/LongLive)
23
 
24
+ <div align="center">
25
+
26
+ [![Watch the video](assets/video-first-frame.png)](https://www.youtube.com/watch?v=CO1QC7BNvig)
27
+
28
+ </div>
29
+
30
  ## 💡 TLDR: Turn interactive prompts into long videos—instantly, as you type!
31
 
32
  **LongLive: Real-time Interactive Long Video Generation [[Paper](https://arxiv.org/abs/xxx)]** <br />
 
35
  We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU.
36
  With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.
37
 
 
38
  ## News
39
  - [x] [2025.9.25] We release [Paper](https://arxiv.org/abs/xxx), this GitHub repo [LongLive](https://github.com/NVlabs/LongLive) with all training and inference code, the model weight [LongLive-1.3B](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B), and demo page [Website](https://nvlabs.github.io/LongLive).
40
 
 
43
  2. **Real-time Inference**: LongLive supports 20.7 FPS generation speed on a single H100 GPU, and 24.8 FPS with FP8 quantization with marginal quality loss.
44
  3. **Efficient Fine-tuning**: LongLive extends a short-clip model to minute-long generation in 32 H100 GPU-days.
45
 
46
+ ## Introduction
47
+ <p align="center" style="border-radius: 10px">
48
+ <img src="assets/pipeline.jpg" width="100%" alt="logo"/>
49
+ <strong>LongLive accepts sequential user prompts and generates corresponding videos in real time, enabling user-guided long video generation.</strong>
50
+ </p>
51
+ <p align="center" style="border-radius: 10px">
52
+ <img src="assets/framework.png" width="100%" alt="logo"/>
53
+ <strong>The framework of LongLive. (Left) Frame Sink + Short window attention. (Right) KV-recache.</strong>
54
+ </p>
55
+ <p align="center" style="border-radius: 10px">
56
+ <img src="assets/algo.png" width="100%" alt="logo"/>
57
+ <strong>The algorithm of streaming long tuning and interactive inference.</strong>
58
+ </p>
59
+ <p align="center" style="border-radius: 10px">
60
+ <img src="assets/frame_sink.png" width="100%" alt="logo"/>
61
+ <strong>The effectiveness of Frame Sink.</strong>
62
+ </p>
63
+ <p align="center" style="border-radius: 10px">
64
+ <img src="assets/effects-KV-recache.png" width="100%" alt="logo"/>
65
+ <strong>The effectiveness of KV re-cache. Consistent transitions with new-prompt compliance.</strong>
66
+ </p>
67
+ <p align="center" style="border-radius: 10px">
68
+ <img src="assets/demo.png" width="100%" alt="logo"/>
69
+ <strong>Interactive 60s videos with 6 prompts. See our demo <a href="https://nvlabs.github.io/LongLive"><strong>Website</strong></a> for video examples.</strong>
70
+ </p>
71
+
72
 
73
  ## Installation
74
  **Requirements**
 
109
  ```
110
  bash interactive_inference.sh
111
  ```
 
 
 
 
112
 
113
+ ## How to contribute
114
+ - Make sure to have git installed.
115
+ - Create your own [fork](https://github.com/NVlabs/LongLive/fork) of the project.
116
+ - Clone the repository on your local machine, using git clone and pasting the url of this project.
117
+ - Read both the `Requirements` and `Installation and Quick Guide` sections below.
118
+ - Commit and push your changes.
119
+ - Make a pull request when finished modifying the project.
120
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
  ## Citation
123
  Please consider to cite our paper and this framework, if they are helpful in your research.
 
133
  ```
134
 
135
  ## License
 
136
  - LongLive-1.3B model weight is under CC-BY-NC 4.0 license.
137
 
138
  ## Acknowledgement
139
  - [Self-Forcing](https://github.com/hiyouga/EasyR1): the codebase and algorithm we built upon. Thanks for their wonderful work.
140
+ - [Wan](https://github.com/volcengine/verl): the base model we built upon. Thanks for their wonderful work.