Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,250 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AsymmetricMagVitV2
|
| 2 |
+
Lightweight open-source reproduction of MagVitV2, fully aligned with the paper’s functionality. Supports image and video joint encoding and decoding, as well as videos of arbitrary length and resolution.
|
| 3 |
+
|
| 4 |
+
* All spatio-temporal operators are implemented using causal 3D to avoid video instability caused by 2D+1D, ensures that the FVD does not sudden increases.
|
| 5 |
+
* The Encoder and Decoder support arbitrary resolutions, support auto-regressive inference for arbitrary durations.
|
| 6 |
+
* Training employs multi-resolution and dynamic-duration mixed training, allowing decoding of videos with arbitrary odd frames as long as GPU memory permits, demonstrating temporal extrapolation capability.
|
| 7 |
+
* The model is closely aligned with MagVitV2 but with reduced parameter, particularly in the lightweight Encoder, reducing the burden of caching VAE features.
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
## Demo
|
| 12 |
+
|
| 13 |
+
### 4 channel VAE video reconstruction
|
| 14 |
+
|
| 15 |
+
##### video reconstruction
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
<table>
|
| 19 |
+
<tr>
|
| 20 |
+
<td width="50%">
|
| 21 |
+
<a href="https://github-production-user-asset-6210df.s3.amazonaws.com/174133722/346252347-2ec0bc1b-7a32-4949-a68f-d512bd4c5411.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240706T045606Z&X-Amz-Expires=300&X-Amz-Signature=1a67649f0e70c9928c7c11201d48a89d9205e39919e265bb95c3aaf24918d520&X-Amz-SignedHeaders=host&actor_id=174133722&key_id=0&repo_id=824684001">
|
| 22 |
+
<img src="https://github-production-user-asset-6210df.s3.amazonaws.com/174133722/346252347-2ec0bc1b-7a32-4949-a68f-d512bd4c5411.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240706T045606Z&X-Amz-Expires=300&X-Amz-Signature=1a67649f0e70c9928c7c11201d48a89d9205e39919e265bb95c3aaf24918d520&X-Amz-SignedHeaders=host&actor_id=174133722&key_id=0&repo_id=824684001" alt="60s 3840x2160" style="width: 100%;">
|
| 23 |
+
</a>
|
| 24 |
+
</td>
|
| 25 |
+
<td width="50%">
|
| 26 |
+
<a href="https://github-production-user-asset-6210df.s3.amazonaws.com/174133722/346252007-c8df652d-e7b9-42ff-a554-1271d5a0bce1.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240706T045622Z&X-Amz-Expires=300&X-Amz-Signature=f7cc4b1d70ecd17bbf546e517593316cf098209bcf1950122e33a453252d8611&X-Amz-SignedHeaders=host&actor_id=174133722&key_id=0&repo_id=824684001">
|
| 27 |
+
<img src="data/show/gif/vae_4z_bf16_sw_17_tokyo_walk_h264_16s.gif" alt="60s 1920x1080" style="width: 100%;">
|
| 28 |
+
</a>
|
| 29 |
+
</td>
|
| 30 |
+
</tr>
|
| 31 |
+
</table>
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
<iframe
|
| 35 |
+
src=""
|
| 36 |
+
scrolling="no"
|
| 37 |
+
border="0"
|
| 38 |
+
frameborder="no"
|
| 39 |
+
framespacing="0"
|
| 40 |
+
allowfullscreen="true"
|
| 41 |
+
height=600
|
| 42 |
+
width=800>
|
| 43 |
+
</iframe>
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
* Converting MP4 to GIF may result in detail loss, pixelation, and incomplete duration. It is recommended to watch the original video for the best experience.
|
| 48 |
+
|
| 49 |
+
###### 60s 3840x2160
|
| 50 |
+
|
| 51 |
+
[bilibili_Black Myth:Wu KongULR 4zVAE](https://www.bilibili.com/video/BV1mjaPe8EWn/?spm_id_from=333.999.0.0&vd_source=681432e843390b0f7192d64fa4ed9613)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
###### 60s 1920x1080
|
| 56 |
+
|
| 57 |
+
[bilibili_tokyo_walk ULR 4zVAE](https://www.bilibili.com/video/BV1cCaceLEiq/?t=8.7&vd_source=681432e843390b0f7192d64fa4ed9613)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
##### image reconstruction
|
| 62 |
+
|
| 63 |
+
<table>
|
| 64 |
+
<tr>
|
| 65 |
+
<td><img src="data/show/images/4z/mj_1.png" alt="1" style="width:100%;"></td>
|
| 66 |
+
<td><img src="data/show/images/4z/mj_2.png" alt="2" style="width:100%;"></td>
|
| 67 |
+
<td><img src="data/show/images/4z/mj_3.png" alt="3" style="width:100%;"></td>
|
| 68 |
+
</tr>
|
| 69 |
+
<tr>
|
| 70 |
+
<td><img src="data/show/images/4z/mj_4.png" alt="4" style="width:100%;"></td>
|
| 71 |
+
<td><img src="data/show/images/4z/mj_5.png" alt="5" style="width:100%;"></td>
|
| 72 |
+
<td><img src="data/show/images/4z/mj_6.png" alt="6" style="width:100%;"></td>
|
| 73 |
+
</tr>
|
| 74 |
+
<tr>
|
| 75 |
+
<td><img src="data/show/images/4z/mj_7.png" alt="7" style="width:100%;"></td>
|
| 76 |
+
<td><img src="data/show/images/4z/mj_8.png" alt="8" style="width:100%;"></td>
|
| 77 |
+
<td><img src="data/show/images/4z/mj_9.png" alt="9" style="width:100%;"></td>
|
| 78 |
+
</tr>
|
| 79 |
+
</table>
|
| 80 |
+
|
| 81 |
+
* The original images are located in data/images
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
## Contents
|
| 88 |
+
|
| 89 |
+
- [Installation](#installation)
|
| 90 |
+
- [Model Weights](#model-weights)
|
| 91 |
+
- [Metric](#metric)
|
| 92 |
+
- [Inference](#inference)
|
| 93 |
+
- [TODO List](#1)
|
| 94 |
+
- [Contact Us](#2)
|
| 95 |
+
- [Reference](#3)
|
| 96 |
+
|
| 97 |
+
### Installation
|
| 98 |
+
|
| 99 |
+
<a name="installation"></a>
|
| 100 |
+
|
| 101 |
+
#### 1. Clone the repo
|
| 102 |
+
|
| 103 |
+
```shell
|
| 104 |
+
git clonehttps://github.com/bornfly-detachment/AsymmetricMagVitV2.git
|
| 105 |
+
cd AsymmetricMagVitV2
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
#### 2. Setting up the virtualenv
|
| 109 |
+
|
| 110 |
+
This is assuming you have navigated to the `AsymmetricMagVitV2` root after cloning it.
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
```shell
|
| 114 |
+
# install required packages from pypi
|
| 115 |
+
python3 -m venv .pt2
|
| 116 |
+
source .pt2/bin/activate
|
| 117 |
+
pip3 install -r requirements/pt2.txt
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
### Model Weights
|
| 122 |
+
|
| 123 |
+
<details>
|
| 124 |
+
<summary>View more</summary>
|
| 125 |
+
|
| 126 |
+
| model | downsample (THW) | Encoder Size | Decoder Size|
|
| 127 |
+
|--------|--------|------|------|
|
| 128 |
+
|svd 2Dvae|1x8x8|34M|64M|
|
| 129 |
+
|AsymmetricMagVitV2|4x8x8|100M|159M|
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
| model | Data | #iterations | URL |
|
| 133 |
+
|------------------------|--------------|-------------|-----------------------------------------------------------------------|
|
| 134 |
+
| AsymmetricMagVitV2_4z |20M Intervid | 2node 1200k | [AsymmetricMagVitV2_4z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_4z) |
|
| 135 |
+
| AsymmetricMagVitV2_16z |20M Intervid | 4node 860k | [AsymmetricMagVitV2_16z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_16z) |
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
</details>
|
| 139 |
+
|
| 140 |
+
### Metric
|
| 141 |
+
|
| 142 |
+
<a name="Metric"></a>
|
| 143 |
+
|
| 144 |
+
|model|temporal-frame| fvd |fid|psnr|ssim|
|
| 145 |
+
|-----|----|-----------|--|----|----|
|
| 146 |
+
|SVD VAE|1 | 190.6 |1.8|28.2|1.0|
|
| 147 |
+
|openSoraPlan|1 | 249.8 |1.04|29.6|0.99|
|
| 148 |
+
|openSoraPlan|17 | 725.4 |3.17|23.4|0.89|
|
| 149 |
+
|openSoraPlan|33 | 756.8 |3.5|23|0.89|
|
| 150 |
+
|AsymmetricMagVitV2_4z|1 | 113.5 |1.4|29.8|1.0|
|
| 151 |
+
|AsymmetricMagVitV2_4z|17 | 278.5 |2.3|26.4|1.0|
|
| 152 |
+
|AsymmetricMagVitV2_4z|33 | 293.3 |2.5|26.3|1.0|
|
| 153 |
+
|
| 154 |
+
Note:
|
| 155 |
+
1. The test video is the original scale of data/videos/tokyo_walk.mp4. Previously, preprocessing with resize+CenterCrop256
|
| 156 |
+
resolution was also tested on a larger test set, and the results showed consistent trends. Now, it has been found
|
| 157 |
+
that high-resolution and original-sized videos pose the most challenging task for 3DVAE. Therefore, only this one video was tested,
|
| 158 |
+
configured at 8fps, and evaluated for the first 10 seconds.
|
| 159 |
+
2. The evaluation code can be referenced in models/evaluation.py. However, it has been a while since I last ran it,
|
| 160 |
+
and there have been modifications to the inference code. Calculating FID and FVD scores depends on the model,
|
| 161 |
+
original image preprocessing, inference hyperparameters, and the randomness introduced by sampling encoder KL.
|
| 162 |
+
As a result, scores cannot be accurately reproduced. Nonetheless, this can serve as a reference for designing
|
| 163 |
+
one’s own benchmark.
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
### Inference
|
| 168 |
+
|
| 169 |
+
#### About Encoder hyperparameter configuration
|
| 170 |
+
* slice frame spatial using: --max_siz --min_size
|
| 171 |
+
* slice video temporal using: --encoder_init_window --encoder_window
|
| 172 |
+
|
| 173 |
+
If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 256 and 512 at maximum.
|
| 174 |
+
|
| 175 |
+
#### About Decoder hyperparameter configuration
|
| 176 |
+
|
| 177 |
+
* slice latent spatial using: --min_latent_size --max_latent_size
|
| 178 |
+
|
| 179 |
+
(default GPU VRAM needs to exceed 28GB. If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 32=256p/8 and 64=512p/8 at maximum.)
|
| 180 |
+
|
| 181 |
+
* slice latent temporal using: --decoder_init_window,
|
| 182 |
+
|
| 183 |
+
5 frames of latent space corresponds to 17 frames of the original video.
|
| 184 |
+
The calculation formula is as follows: latent_T_dim = (frame_T_dim - 1) / temporal_downsample_num; in this model, temporal_downsample_num=4
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
#### Use AsymmetricMagVitV2 in your own code
|
| 188 |
+
|
| 189 |
+
```python
|
| 190 |
+
|
| 191 |
+
from models.vae import AsymmetricMagVitV2Pipline
|
| 192 |
+
import torch
|
| 193 |
+
from models.utils.image_op import imdenormalize, imnormalize, read_video, read_image
|
| 194 |
+
import torchvision.transforms as transforms
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 198 |
+
dtype = torch.bfloat16
|
| 199 |
+
encoder_init_window = 17
|
| 200 |
+
input_path = "data/videos/tokyo_walk.mp4"
|
| 201 |
+
img_transform = transforms.Compose([transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
|
| 202 |
+
input, last_frame_id = read_video(input_path, encoder_init_window, sample_fps=8, img_transform, start=0)
|
| 203 |
+
|
| 204 |
+
model = AsymmetricMagVitV2Pipline.from_pretrained("BornFly/AsymmetricMagVitV2_16z").to(device, dtype).eval()
|
| 205 |
+
init_z, reg_log = model.encode(input, encoder_init_window, is_init_image=True, return_reg_log=True, unregularized=False)
|
| 206 |
+
init_samples = model.decode(init_z.to(device, dtype), decode_batch_size=1, is_init_image=True)
|
| 207 |
+
|
| 208 |
+
```
|
| 209 |
+
|
| 210 |
+
#### High-resolution video encoding and decoding, greater than 720p(spatial-temporal slice)
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
##### 1. encode & decode video
|
| 214 |
+
```shell
|
| 215 |
+
python infer_vae.py --input_path data/videos/tokyo-walk.mp4 --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_videos > infer_vae_video.log 2>&1
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
##### 2. encode & decode image
|
| 219 |
+
|
| 220 |
+
```shell
|
| 221 |
+
python infer_vae.py --input_path data/images --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_images > infer_vae_image.log 2>&1
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
### TODO List
|
| 226 |
+
|
| 227 |
+
<p id="1"></p>
|
| 228 |
+
|
| 229 |
+
* Reproducing Sora, a 16-channel VAE integrated with SD3. Due to limited computational resources, the focus is on generating 1K high-definition dynamic wallpapers.
|
| 230 |
+
|
| 231 |
+
* Reproducing VideoPoet, supporting multimodal joint representation. Due to limited computational resources, the focus is on generating music videos.
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
### Contact Us
|
| 235 |
+
|
| 236 |
+
<p id="2"></p>
|
| 237 |
+
|
| 238 |
+
1. If there are any code-related questions, feel free to contact me via email——bornflyborntochange@outlook.com.
|
| 239 |
+
2. You need to scan the image to join the WeChat group or if it is expired, add this student as a friend first to invite you.
|
| 240 |
+
<img src="data/assets/mmqrcode1720196270375.png" alt="ding group" width="30%"/>
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
### Reference
|
| 245 |
+
|
| 246 |
+
<p id="3"></p>
|
| 247 |
+
|
| 248 |
+
- Open-Sora-Plan: https://github.com/PKU-YuanGroup/Open-Sora-Plan
|
| 249 |
+
- Open-Sora: https://github.com/hpcaitech/Open-Sora
|
| 250 |
+
- SVD: https://github.com/Stability-AI/generative-models
|