Add pipeline tag and Hugging Face paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +11 -7
README.md CHANGED
@@ -1,20 +1,24 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
3
- language:
4
- - en
5
  base_model:
6
  - Wan-AI/Wan2.1-T2V-1.3B
 
 
 
7
  tags:
8
  - Real-time
9
  - LongVideoGeneration
10
  - Interactive
 
11
  ---
 
12
  <p align="center" style="border-radius: 10px">
13
  <img src="assets/LongLive-logo.png" width="100%" alt="logo"/>
14
  </p>
15
 
16
  # 🎬 LongLive: Real-time Interactive Long Video Generation
17
 
 
 
18
  [![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://arxiv.org/abs/2509.22622)
19
  [![Code](https://img.shields.io/badge/GitHub-LongLive-blue)](https://github.com/NVlabs/LongLive)
20
  [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B)
@@ -32,7 +36,7 @@ tags:
32
  **LongLive: Real-time Interactive Long Video Generation [[Paper](https://arxiv.org/abs/2509.22622)]** <br />
33
  [Shuai Yang](https://andysonys.github.io/), [Wei Huang](https://aaron-weihuang.com/), [Ruihang Chu](https://ruihang-chu.github.io/), [Yicheng Xiao](https://easonxiao-888.github.io/), [Yuyang Zhao](https://yuyangzhao.com/), [Xianbang Wang](https://peppaking8.github.io/), [Muyang Li](https://lmxyy.me/), [Enze Xie](https://xieenze.github.io/), [Yingcong Chen](https://www.yingcong.me/), [Yao Lu](https://scholar.google.com/citations?user=OI7zFmwAAAAJ&hl=en), [Song Han](http://songhan.mit.edu/), [Yukang Chen](https://yukangchen.com/) <br />
34
 
35
- We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU.
36
  With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.
37
 
38
  ## News
@@ -41,7 +45,7 @@ With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quali
41
  ## Highlights
42
  1. **Long Video Gen**: LongLive supports up to 240s video generation, with visual consistency.
43
  2. **Real-time Inference**: LongLive supports 20.7 FPS generation speed on a single H100 GPU, and 24.8 FPS with FP8 quantization with marginal quality loss.
44
- 3. **Efficient Fine-tuning**: LongLive extends a short-clip model to minute-long generation in 32 H100 GPU-days.
45
 
46
  ## Introduction
47
  <p align="center" style="border-radius: 10px">
@@ -138,5 +142,5 @@ Please consider to cite our paper and this framework, if they are helpful in you
138
  - LongLive-1.3B model weight is under CC-BY-NC 4.0 license.
139
 
140
  ## Acknowledgement
141
- - [Self-Forcing](https://github.com/hiyouga/EasyR1): the codebase and algorithm we built upon. Thanks for their wonderful work.
142
- - [Wan](https://github.com/volcengine/verl): the base model we built upon. Thanks for their wonderful work.
 
1
  ---
 
 
 
2
  base_model:
3
  - Wan-AI/Wan2.1-T2V-1.3B
4
+ language:
5
+ - en
6
+ license: cc-by-nc-sa-4.0
7
  tags:
8
  - Real-time
9
  - LongVideoGeneration
10
  - Interactive
11
+ pipeline_tag: text-to-video
12
  ---
13
+
14
  <p align="center" style="border-radius: 10px">
15
  <img src="assets/LongLive-logo.png" width="100%" alt="logo"/>
16
  </p>
17
 
18
  # 🎬 LongLive: Real-time Interactive Long Video Generation
19
 
20
+ This model is presented in the paper [LongLive: Real-time Interactive Long Video Generation](https://huggingface.co/papers/2509.22622).
21
+
22
  [![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://arxiv.org/abs/2509.22622)
23
  [![Code](https://img.shields.io/badge/GitHub-LongLive-blue)](https://github.com/NVlabs/LongLive)
24
  [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B)
 
36
  **LongLive: Real-time Interactive Long Video Generation [[Paper](https://arxiv.org/abs/2509.22622)]** <br />
37
  [Shuai Yang](https://andysonys.github.io/), [Wei Huang](https://aaron-weihuang.com/), [Ruihang Chu](https://ruihang-chu.github.io/), [Yicheng Xiao](https://easonxiao-888.github.io/), [Yuyang Zhao](https://yuyangzhao.com/), [Xianbang Wang](https://peppaking8.github.io/), [Muyang Li](https://lmxyy.me/), [Enze Xie](https://xieenze.github.io/), [Yingcong Chen](https://www.yingcong.me/), [Yao Lu](https://scholar.google.com/citations?user=OI7zFmwAAAAJ&hl=en), [Song Han](http://songhan.mit.edu/), [Yukang Chen](https://yukangchen.com/) <br />
38
 
39
+ We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU.
40
  With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss.
41
 
42
  ## News
 
45
  ## Highlights
46
  1. **Long Video Gen**: LongLive supports up to 240s video generation, with visual consistency.
47
  2. **Real-time Inference**: LongLive supports 20.7 FPS generation speed on a single H100 GPU, and 24.8 FPS with FP8 quantization with marginal quality loss.
48
+ 3. **Efficient Fine-tuning**: LongLive extends a short-clip model to minute-long generation in just 32 H100 GPU-days.
49
 
50
  ## Introduction
51
  <p align="center" style="border-radius: 10px">
 
142
  - LongLive-1.3B model weight is under CC-BY-NC 4.0 license.
143
 
144
  ## Acknowledgement
145
+ - [Self-Forcing](https://github.com/guandeh17/Self-Forcing): the codebase and algorithm we built upon. Thanks for their wonderful work.
146
+ - [Wan](https://github.com/volcengine/verl): the base model we built upon. Thanks for their wonderful work.