daiqi commited on
Commit
c026e54
·
verified ·
1 Parent(s): 8ba816b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-to-video
4
+ tags:
5
+ - VAE
6
+ - Video-Generation
7
+ ---
8
+
9
+ # Reducio-VAE Model Card
10
+
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+ This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of $\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}$, enabling $4096\times$ downsampling.
13
+ It is part of the [Reducio-DiT](https://arxiv.org/abs/xxxx), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE).
14
+
15
+
16
+ ## Model Details
17
+
18
+ ### Model Sources
19
+
20
+ <!-- Provide the basic links for the model. -->
21
+
22
+ - **Repository:** [GitHub Repository](https://github.com/microsoft/Reducio-VAE)
23
+ - **Paper:** [arXiv](https://arxiv.org/abs/xxxx)
24
+
25
+ ## Uses
26
+
27
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
28
+
29
+ Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE/Readme.md).
30
+
31
+ ### Direct Use
32
+
33
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
34
+
35
+ The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.
36
+
37
+ ## Usage Example
38
+
39
+ Use the code below to get started with the model.
40
+ ```python
41
+ import torch
42
+ ```
43
+
44
+
45
+
46
+ ## Results
47
+
48
+ <!-- This section describes the evaluation protocols and provides the results. -->
49
+
50
+
51
+ ### Results
52
+
53
+ Metrics on 1K Pexels validation set and UCF-101:
54
+
55
+ |Method|Downsample Factor|$\|z\|$|PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)|
56
+ |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------|
57
+ |SD2.1-VAE|$1\times8\times8$|4|29.23|0.82|0.09|25.96|21.00|
58
+ |SDXL-VAE|$1\times8\times8$|16|30.54|0.85|0.08|19.87|23.68|
59
+ |OmniTokenizer|$4\times8\times8$|8|27.11|0.89|0.07|23.88|30.52|
60
+ |OpenSora-1.2|$4\times8\times8$|16|30.72|0.85|0.11|60.88|67.52|
61
+ |Cosmos Tokenizer|$8\times8\times8$|16|30.84|0.74|0.12|29.44|22.06|
62
+ |Cosmos Tokenizer|$8\times16\times16$|16|28.14|0.65|0.18|77.87|119.37|
63
+ |Reducio-VAE|$4\times32\times32$|16|35.88|0.94|0.05|17.88|65.17|
64
+
65
+
66
+ ## Citation
67
+
68
+ **BibTeX:**
69
+
70
+ ```
71
+ @article{tian2024reducio,
72
+ title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents},
73
+ author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
74
+ journal={arXiv preprint arXiv:xxxx},
75
+ year={2024}
76
+ }
77
+ ```