BornFly commited on
Commit
f9e029c
·
verified ·
1 Parent(s): cd1fa90

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +250 -3
README.md CHANGED
@@ -1,3 +1,250 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AsymmetricMagVitV2
2
+ Lightweight open-source reproduction of MagVitV2, fully aligned with the paper’s functionality. Supports image and video joint encoding and decoding, as well as videos of arbitrary length and resolution.
3
+
4
+ * All spatio-temporal operators are implemented using causal 3D to avoid video instability caused by 2D+1D, ensures that the FVD does not sudden increases.
5
+ * The Encoder and Decoder support arbitrary resolutions, support auto-regressive inference for arbitrary durations.
6
+ * Training employs multi-resolution and dynamic-duration mixed training, allowing decoding of videos with arbitrary odd frames as long as GPU memory permits, demonstrating temporal extrapolation capability.
7
+ * The model is closely aligned with MagVitV2 but with reduced parameter, particularly in the lightweight Encoder, reducing the burden of caching VAE features.
8
+
9
+
10
+
11
+ ## Demo
12
+
13
+ ### 4 channel VAE video reconstruction
14
+
15
+ ##### video reconstruction
16
+
17
+
18
+ <table>
19
+ <tr>
20
+ <td width="50%">
21
+ <a href="https://github-production-user-asset-6210df.s3.amazonaws.com/174133722/346252347-2ec0bc1b-7a32-4949-a68f-d512bd4c5411.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240706T045606Z&X-Amz-Expires=300&X-Amz-Signature=1a67649f0e70c9928c7c11201d48a89d9205e39919e265bb95c3aaf24918d520&X-Amz-SignedHeaders=host&actor_id=174133722&key_id=0&repo_id=824684001">
22
+ <img src="https://github-production-user-asset-6210df.s3.amazonaws.com/174133722/346252347-2ec0bc1b-7a32-4949-a68f-d512bd4c5411.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240706T045606Z&X-Amz-Expires=300&X-Amz-Signature=1a67649f0e70c9928c7c11201d48a89d9205e39919e265bb95c3aaf24918d520&X-Amz-SignedHeaders=host&actor_id=174133722&key_id=0&repo_id=824684001" alt="60s 3840x2160" style="width: 100%;">
23
+ </a>
24
+ </td>
25
+ <td width="50%">
26
+ <a href="https://github-production-user-asset-6210df.s3.amazonaws.com/174133722/346252007-c8df652d-e7b9-42ff-a554-1271d5a0bce1.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240706%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240706T045622Z&X-Amz-Expires=300&X-Amz-Signature=f7cc4b1d70ecd17bbf546e517593316cf098209bcf1950122e33a453252d8611&X-Amz-SignedHeaders=host&actor_id=174133722&key_id=0&repo_id=824684001">
27
+ <img src="data/show/gif/vae_4z_bf16_sw_17_tokyo_walk_h264_16s.gif" alt="60s 1920x1080" style="width: 100%;">
28
+ </a>
29
+ </td>
30
+ </tr>
31
+ </table>
32
+
33
+
34
+ <iframe
35
+ src=""
36
+ scrolling="no"
37
+ border="0"
38
+ frameborder="no"
39
+ framespacing="0"
40
+ allowfullscreen="true"
41
+ height=600
42
+ width=800>
43
+ </iframe>
44
+
45
+
46
+
47
+ * Converting MP4 to GIF may result in detail loss, pixelation, and incomplete duration. It is recommended to watch the original video for the best experience.
48
+
49
+ ###### 60s 3840x2160
50
+
51
+ [bilibili_Black Myth:Wu KongULR 4zVAE](https://www.bilibili.com/video/BV1mjaPe8EWn/?spm_id_from=333.999.0.0&vd_source=681432e843390b0f7192d64fa4ed9613)
52
+
53
+
54
+
55
+ ###### 60s 1920x1080
56
+
57
+ [bilibili_tokyo_walk ULR 4zVAE](https://www.bilibili.com/video/BV1cCaceLEiq/?t=8.7&vd_source=681432e843390b0f7192d64fa4ed9613)
58
+
59
+
60
+
61
+ ##### image reconstruction
62
+
63
+ <table>
64
+ <tr>
65
+ <td><img src="data/show/images/4z/mj_1.png" alt="1" style="width:100%;"></td>
66
+ <td><img src="data/show/images/4z/mj_2.png" alt="2" style="width:100%;"></td>
67
+ <td><img src="data/show/images/4z/mj_3.png" alt="3" style="width:100%;"></td>
68
+ </tr>
69
+ <tr>
70
+ <td><img src="data/show/images/4z/mj_4.png" alt="4" style="width:100%;"></td>
71
+ <td><img src="data/show/images/4z/mj_5.png" alt="5" style="width:100%;"></td>
72
+ <td><img src="data/show/images/4z/mj_6.png" alt="6" style="width:100%;"></td>
73
+ </tr>
74
+ <tr>
75
+ <td><img src="data/show/images/4z/mj_7.png" alt="7" style="width:100%;"></td>
76
+ <td><img src="data/show/images/4z/mj_8.png" alt="8" style="width:100%;"></td>
77
+ <td><img src="data/show/images/4z/mj_9.png" alt="9" style="width:100%;"></td>
78
+ </tr>
79
+ </table>
80
+
81
+ * The original images are located in data/images
82
+
83
+
84
+
85
+
86
+
87
+ ## Contents
88
+
89
+ - [Installation](#installation)
90
+ - [Model Weights](#model-weights)
91
+ - [Metric](#metric)
92
+ - [Inference](#inference)
93
+ - [TODO List](#1)
94
+ - [Contact Us](#2)
95
+ - [Reference](#3)
96
+
97
+ ### Installation
98
+
99
+ <a name="installation"></a>
100
+
101
+ #### 1. Clone the repo
102
+
103
+ ```shell
104
+ git clonehttps://github.com/bornfly-detachment/AsymmetricMagVitV2.git
105
+ cd AsymmetricMagVitV2
106
+ ```
107
+
108
+ #### 2. Setting up the virtualenv
109
+
110
+ This is assuming you have navigated to the `AsymmetricMagVitV2` root after cloning it.
111
+
112
+
113
+ ```shell
114
+ # install required packages from pypi
115
+ python3 -m venv .pt2
116
+ source .pt2/bin/activate
117
+ pip3 install -r requirements/pt2.txt
118
+ ```
119
+
120
+
121
+ ### Model Weights
122
+
123
+ <details>
124
+ <summary>View more</summary>
125
+
126
+ | model | downsample (THW) | Encoder Size | Decoder Size|
127
+ |--------|--------|------|------|
128
+ |svd 2Dvae|1x8x8|34M|64M|
129
+ |AsymmetricMagVitV2|4x8x8|100M|159M|
130
+
131
+
132
+ | model | Data | #iterations | URL |
133
+ |------------------------|--------------|-------------|-----------------------------------------------------------------------|
134
+ | AsymmetricMagVitV2_4z |20M Intervid | 2node 1200k | [AsymmetricMagVitV2_4z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_4z) |
135
+ | AsymmetricMagVitV2_16z |20M Intervid | 4node 860k | [AsymmetricMagVitV2_16z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_16z) |
136
+
137
+
138
+ </details>
139
+
140
+ ### Metric
141
+
142
+ <a name="Metric"></a>
143
+
144
+ |model|temporal-frame| fvd |fid|psnr|ssim|
145
+ |-----|----|-----------|--|----|----|
146
+ |SVD VAE|1 | 190.6 |1.8|28.2|1.0|
147
+ |openSoraPlan|1 | 249.8 |1.04|29.6|0.99|
148
+ |openSoraPlan|17 | 725.4 |3.17|23.4|0.89|
149
+ |openSoraPlan|33 | 756.8 |3.5|23|0.89|
150
+ |AsymmetricMagVitV2_4z|1 | 113.5 |1.4|29.8|1.0|
151
+ |AsymmetricMagVitV2_4z|17 | 278.5 |2.3|26.4|1.0|
152
+ |AsymmetricMagVitV2_4z|33 | 293.3 |2.5|26.3|1.0|
153
+
154
+ Note:
155
+ 1. The test video is the original scale of data/videos/tokyo_walk.mp4. Previously, preprocessing with resize+CenterCrop256
156
+ resolution was also tested on a larger test set, and the results showed consistent trends. Now, it has been found
157
+ that high-resolution and original-sized videos pose the most challenging task for 3DVAE. Therefore, only this one video was tested,
158
+ configured at 8fps, and evaluated for the first 10 seconds.
159
+ 2. The evaluation code can be referenced in models/evaluation.py. However, it has been a while since I last ran it,
160
+ and there have been modifications to the inference code. Calculating FID and FVD scores depends on the model,
161
+ original image preprocessing, inference hyperparameters, and the randomness introduced by sampling encoder KL.
162
+ As a result, scores cannot be accurately reproduced. Nonetheless, this can serve as a reference for designing
163
+ one’s own benchmark.
164
+
165
+
166
+
167
+ ### Inference
168
+
169
+ #### About Encoder hyperparameter configuration
170
+ * slice frame spatial using: --max_siz --min_size
171
+ * slice video temporal using: --encoder_init_window --encoder_window
172
+
173
+ If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 256 and 512 at maximum.
174
+
175
+ #### About Decoder hyperparameter configuration
176
+
177
+ * slice latent spatial using: --min_latent_size --max_latent_size
178
+
179
+ (default GPU VRAM needs to exceed 28GB. If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 32=256p/8 and 64=512p/8 at maximum.)
180
+
181
+ * slice latent temporal using: --decoder_init_window,
182
+
183
+ 5 frames of latent space corresponds to 17 frames of the original video.
184
+ The calculation formula is as follows: latent_T_dim = (frame_T_dim - 1) / temporal_downsample_num; in this model, temporal_downsample_num=4
185
+
186
+
187
+ #### Use AsymmetricMagVitV2 in your own code
188
+
189
+ ```python
190
+
191
+ from models.vae import AsymmetricMagVitV2Pipline
192
+ import torch
193
+ from models.utils.image_op import imdenormalize, imnormalize, read_video, read_image
194
+ import torchvision.transforms as transforms
195
+
196
+
197
+ device = "cuda" if torch.cuda.is_available() else "cpu"
198
+ dtype = torch.bfloat16
199
+ encoder_init_window = 17
200
+ input_path = "data/videos/tokyo_walk.mp4"
201
+ img_transform = transforms.Compose([transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
202
+ input, last_frame_id = read_video(input_path, encoder_init_window, sample_fps=8, img_transform, start=0)
203
+
204
+ model = AsymmetricMagVitV2Pipline.from_pretrained("BornFly/AsymmetricMagVitV2_16z").to(device, dtype).eval()
205
+ init_z, reg_log = model.encode(input, encoder_init_window, is_init_image=True, return_reg_log=True, unregularized=False)
206
+ init_samples = model.decode(init_z.to(device, dtype), decode_batch_size=1, is_init_image=True)
207
+
208
+ ```
209
+
210
+ #### High-resolution video encoding and decoding, greater than 720p(spatial-temporal slice)
211
+
212
+
213
+ ##### 1. encode & decode video
214
+ ```shell
215
+ python infer_vae.py --input_path data/videos/tokyo-walk.mp4 --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_videos > infer_vae_video.log 2>&1
216
+ ```
217
+
218
+ ##### 2. encode & decode image
219
+
220
+ ```shell
221
+ python infer_vae.py --input_path data/images --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_images > infer_vae_image.log 2>&1
222
+ ```
223
+
224
+
225
+ ### TODO List
226
+
227
+ <p id="1"></p>
228
+
229
+ * Reproducing Sora, a 16-channel VAE integrated with SD3. Due to limited computational resources, the focus is on generating 1K high-definition dynamic wallpapers.
230
+
231
+ * Reproducing VideoPoet, supporting multimodal joint representation. Due to limited computational resources, the focus is on generating music videos.
232
+
233
+
234
+ ### Contact Us
235
+
236
+ <p id="2"></p>
237
+
238
+ 1. If there are any code-related questions, feel free to contact me via email——bornflyborntochange@outlook.com.
239
+ 2. You need to scan the image to join the WeChat group or if it is expired, add this student as a friend first to invite you.
240
+ <img src="data/assets/mmqrcode1720196270375.png" alt="ding group" width="30%"/>
241
+
242
+
243
+
244
+ ### Reference
245
+
246
+ <p id="3"></p>
247
+
248
+ - Open-Sora-Plan: https://github.com/PKU-YuanGroup/Open-Sora-Plan
249
+ - Open-Sora: https://github.com/hpcaitech/Open-Sora
250
+ - SVD: https://github.com/Stability-AI/generative-models