mack-williams commited on
Commit
9fcd27b
·
verified ·
1 Parent(s): c87d042

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -28
README.md CHANGED
@@ -24,14 +24,6 @@ tags:
24
 
25
  (📧 denotes corresponding author.)
26
 
27
-
28
- https://github.com/user-attachments/assets/2daa9f17-329e-4019-8f14-68ac2c467592
29
-
30
-
31
- <em>
32
- (Results on Self Forcing 1.3B. Left: Dense Attention. Right: 1.3x acceleration using Light Forcing)
33
- </em>
34
-
35
  </div>
36
 
37
  ### 💡 Why Light Forcing
@@ -44,8 +36,6 @@ https://github.com/user-attachments/assets/2daa9f17-329e-4019-8f14-68ac2c467592
44
  ### 🧾 Introduction
45
  Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose Light Forcing, the first sparse attention solution tailored for AR video generation models. It incorporates a Chunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a Hierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., 1.2-1.3x end-to-end speedup). Combined with other efficient solutions, Light Forcing further achieves a 2.0-3.0x end-to-end speedup across diverse GPUs (e.g., 27.4 FPS on RTX 5090 and 33.9 FPS on H100).
46
 
47
- <img src="assets/framework.png" width="90%" ></img>
48
-
49
  ## ✨ Quick Start
50
 
51
  ### Environment
@@ -135,15 +125,6 @@ python inference.py \
135
  <th>+Efficient kernel<br>(RoPE, RMSNorm, etc.)</th>
136
  <th>+Light VAE</th>
137
  </tr>
138
- <tr>
139
- <td>Video</td>
140
- <td>5 seconds</td>
141
- <td><video src="https://github.com/user-attachments/assets/59988b4e-e31e-4924-a4de-5d5cee2c4266" width="100%" controls loop></video></td>
142
- <td><video src="https://github.com/user-attachments/assets/9219bb6f-d1d8-4837-816b-ac29a77e13f4" width="100%" controls loop></video></td>
143
- <td><video src="https://github.com/user-attachments/assets/7c374e46-de08-4a45-a006-d7adf8fc61cd" width="100%" controls loop></video></td>
144
- <td><video src="https://github.com/user-attachments/assets/a4e064ad-7d61-4120-a6f3-f8a68be8ccab" width="100%" controls loop></video></td>
145
- <td><video src="https://github.com/user-attachments/assets/7682c7ae-6ffc-4305-a51f-5569d7e8338a" width="100%" controls loop></video></td>
146
- </tr>
147
  <tr>
148
  <td>Latency</td>
149
  <td>5 seconds</td>
@@ -171,15 +152,6 @@ python inference.py \
171
  <td>15.8G</td>
172
  <td>12.7G</td>
173
  </tr>
174
- <tr>
175
- <td>Video</td>
176
- <td>15 seconds</td>
177
- <td><video src="https://github.com/user-attachments/assets/746bb4ce-fa46-46eb-9de5-b7925976d1de" width="100%" controls loop></video></td>
178
- <td><video src="https://github.com/user-attachments/assets/9d8fcdd7-cb81-4539-ab45-acff5a754e7f" width="100%" controls loop></video></td>
179
- <td><video src="https://github.com/user-attachments/assets/b5d79bc5-969a-4125-b15c-32b7e3db327a" width="100%" controls loop></video></td>
180
- <td><video src="https://github.com/user-attachments/assets/daa46c3f-20c2-4920-8536-bf580f2d15e1" width="100%" controls loop></video></td>
181
- <td><video src="https://github.com/user-attachments/assets/2e5fc265-4ba4-481e-acfe-f716cb8f4e5f" width="100%" controls loop></video></td>
182
- </tr>
183
  <tr>
184
  <td>Latency</td>
185
  <td>15 seconds</td>
 
24
 
25
  (📧 denotes corresponding author.)
26
 
 
 
 
 
 
 
 
 
27
  </div>
28
 
29
  ### 💡 Why Light Forcing
 
36
  ### 🧾 Introduction
37
  Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose Light Forcing, the first sparse attention solution tailored for AR video generation models. It incorporates a Chunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a Hierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., 1.2-1.3x end-to-end speedup). Combined with other efficient solutions, Light Forcing further achieves a 2.0-3.0x end-to-end speedup across diverse GPUs (e.g., 27.4 FPS on RTX 5090 and 33.9 FPS on H100).
38
 
 
 
39
  ## ✨ Quick Start
40
 
41
  ### Environment
 
125
  <th>+Efficient kernel<br>(RoPE, RMSNorm, etc.)</th>
126
  <th>+Light VAE</th>
127
  </tr>
 
 
 
 
 
 
 
 
 
128
  <tr>
129
  <td>Latency</td>
130
  <td>5 seconds</td>
 
152
  <td>15.8G</td>
153
  <td>12.7G</td>
154
  </tr>
 
 
 
 
 
 
 
 
 
155
  <tr>
156
  <td>Latency</td>
157
  <td>15 seconds</td>