Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,130 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation
|
| 2 |
+
|
| 3 |
+
## News
|
| 4 |
+
|
| 5 |
+
**[2024.07.17]** We release the [code](https://github.com/bytedance/CascadeV) and pretrained [weights](https://huggingface.co/ByteDance/CascadeV) of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.
|
| 6 |
+
|
| 7 |
+
## Introduction
|
| 8 |
+
|
| 9 |
+
CascadeV is a video generation pipeline built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.
|
| 10 |
+
|
| 11 |
+
## Video VAE
|
| 12 |
+
|
| 13 |
+
Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)
|
| 14 |
+
|
| 15 |
+
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/compare.png" />
|
| 16 |
+
|
| 17 |
+
Video Recontruction: Original (left) vs. Reconstructed (right) | *Click to view the videos*
|
| 18 |
+
|
| 19 |
+
<table class="center">
|
| 20 |
+
<tr>
|
| 21 |
+
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/1.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/1.jpg" /></a></td>
|
| 22 |
+
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/2.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/2.jpg" /></a></td>
|
| 23 |
+
</tr>
|
| 24 |
+
<tr>
|
| 25 |
+
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/3.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/3.jpg" /></a></td>
|
| 26 |
+
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/4.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/4.jpg" /></a></td>
|
| 27 |
+
</tr>
|
| 28 |
+
</table>
|
| 29 |
+
|
| 30 |
+
### 1. Model Architecture
|
| 31 |
+
|
| 32 |
+
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/arch.jpg" />
|
| 33 |
+
|
| 34 |
+
#### 1.1 DiT
|
| 35 |
+
|
| 36 |
+
We use [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma) as our base model with the following modifications:
|
| 37 |
+
|
| 38 |
+
* Replace the original VAE (of [SDXL](https://arxiv.org/abs/2307.01952)) with the one from [Stable Video Diffusion](https://github.com/Stability-AI/generative-models).
|
| 39 |
+
* Use sematic compressor from [StableCascade](https://github.com/Stability-AI/StableCascade) to provide the low-resolution latent input.
|
| 40 |
+
* Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
|
| 41 |
+
* Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.
|
| 42 |
+
|
| 43 |
+
Comparison of 2+1D Attention (left) vs. 3D Attention (right)
|
| 44 |
+
|
| 45 |
+
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/2d1d_vs_3d.gif" />
|
| 46 |
+
|
| 47 |
+
#### 1.2. Grid Attention
|
| 48 |
+
|
| 49 |
+
Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.
|
| 50 |
+
|
| 51 |
+
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/grid.jpg" />
|
| 52 |
+
|
| 53 |
+
### 2. Evaluation
|
| 54 |
+
|
| 55 |
+
Dataset: We perform qualitative comparison with other baselines on the dataset [Inter4K](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html), by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.
|
| 56 |
+
|
| 57 |
+
Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and [VBench](https://github.com/Vchitect/VBench) to evaluate the video quality independently.
|
| 58 |
+
|
| 59 |
+
#### 2.1 PSNR/SSIM/LPIPS
|
| 60 |
+
|
| 61 |
+
Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.
|
| 62 |
+
|
| 63 |
+
| Model/Compression Factor | PSNR↑ | SSIM↑ | LPIPS↓ |
|
| 64 |
+
| -- | -- | -- | -- |
|
| 65 |
+
| Open-Sora-Plan v1.1/4x8x8=256 | 25.7282 | 0.8000 | 0.1030 |
|
| 66 |
+
| EasyAnimate v3/4x8x8=256 | **28.8666** | **0.8505** | **0.0818** |
|
| 67 |
+
| StableCascade/1x32x32=1024 | 24.3336 | 0.6896 | 0.1395 |
|
| 68 |
+
| Ours/1x32x32=1024 | 23.7320 | 0.6742 | 0.1786 |
|
| 69 |
+
|
| 70 |
+
#### 2.2 VBench
|
| 71 |
+
|
| 72 |
+
Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.
|
| 73 |
+
|
| 74 |
+
| Model/Compression Factor | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Imaging Quality | Aesthetic Quality |
|
| 75 |
+
| -- | -- | -- | -- | -- | -- | -- |
|
| 76 |
+
| Open-Sora-Plan v1.1/4x8x8=256 | 0.9519 | 0.9618 | 0.9573 | 0.9789 | 0.6791 | 0.5450 |
|
| 77 |
+
| EasyAnimate v3/4x8x8=256 | 0.9578 | **0.9695** | 0.9615 | **0.9845** | 0.6735 | 0.5535 |
|
| 78 |
+
| StableCascade/1x32x32=1024 | 0.9490 | 0.9517 | 0.9430 | 0.9639 | **0.6811** | **0.5675** |
|
| 79 |
+
| Ours/1x32x32=1024 | **0.9601** | 0.9679 | **0.9626** | 0.9837 | 0.6747 | 0.5579 |
|
| 80 |
+
|
| 81 |
+
### 3. Usage
|
| 82 |
+
|
| 83 |
+
#### 3.1 Installation
|
| 84 |
+
|
| 85 |
+
Recommend to use Conda
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
conda create -n cascadev python==3.9.0
|
| 89 |
+
conda activate cascadev
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
Install [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma)
|
| 93 |
+
|
| 94 |
+
```
|
| 95 |
+
bash install.sh
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
#### 3.2 Download Pretrained Weights
|
| 99 |
+
|
| 100 |
+
```
|
| 101 |
+
bash pretrained/download.sh
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
#### 3.3 Video Reconstruction
|
| 105 |
+
|
| 106 |
+
A sample script for video reconstruction with compression factor of 32
|
| 107 |
+
|
| 108 |
+
```
|
| 109 |
+
bash recon.sh
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
Results of Video Reconstruction: w/o LDM (left) vs. w/ LDM (right)
|
| 113 |
+
|
| 114 |
+
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/w_vs_wo_ldm.png" />
|
| 115 |
+
|
| 116 |
+
*It takes almost 1 minutes to reconstruct a video of shape 8x1024x1024 with one NVIDIA-A800*
|
| 117 |
+
|
| 118 |
+
#### 3.4 Train VAE
|
| 119 |
+
|
| 120 |
+
* Replace "video_list" in configs/s1024.effn-f32.py with your own video datasets
|
| 121 |
+
* Then run
|
| 122 |
+
|
| 123 |
+
```
|
| 124 |
+
bash train_vae.sh
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Acknowledgement
|
| 128 |
+
* [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma): The **main codebase** we built upon.
|
| 129 |
+
* [StableCascade](https://github.com/Stability-AI/StableCascade): Würstchen architecture we used.
|
| 130 |
+
* Thanks [Stable Video Diffusion](https://github.com/Stability-AI/generative-models) for its amazing Video VAE.
|