| | --- |
| | license: mit |
| | tags: |
| | - VAE |
| | - Video-Generation |
| | --- |
| | |
| | # Reducio-VAE Model Card |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| | This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \\(\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\\), enabling 4096x downsampling. |
| | It is part of the [Reducio-DiT](https://arxiv.org/abs/2411.13552), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE). |
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [GitHub Repository](https://github.com/microsoft/Reducio-VAE) |
| | - **Paper:** [arXiv](https://arxiv.org/abs/2411.13552) |
| |
|
| | ## Uses |
| |
|
| | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| |
|
| | Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE). |
| |
|
| | ### Direct Use |
| |
|
| | <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
| |
|
| | The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space. |
| |
|
| |
|
| | ## Results |
| |
|
| | <!-- This section describes the evaluation protocols and provides the results. --> |
| |
|
| |
|
| | ### Results |
| |
|
| | Metrics on 1K Pexels validation set and UCF-101: |
| |
|
| | |Method|Downsample Factor|\|z\||PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)| |
| | |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |
| | |SD2.1-VAE|1\*8\*8|4|29.23|0.82|0.09|25.96|21.00| |
| | |SDXL-VAE|1\*8\*8|16|30.54|0.85|0.08|19.87|23.68| |
| | |OmniTokenizer|4\*8\*8|8|27.11|0.89|0.07|23.88|30.52| |
| | |OpenSora-1.2|4\*8\*8|16|30.72|0.85|0.11|60.88|67.52| |
| | |Cosmos Tokenizer|8\*8\*8|16|30.84|0.74|0.12|29.44|22.06| |
| | |Cosmos Tokenizer|8\*16\*16|16|28.14|0.65|0.18|77.87|119.37| |
| | |Reducio-VAE|4\*32\*32|16|35.88|0.94|0.05|17.88|65.17| |
| |
|
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| |
|
| | ``` |
| | @article{tian2024reducio, |
| | title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents}, |
| | author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang}, |
| | journal={arXiv preprint arXiv:2411.13552}, |
| | year={2024} |
| | } |
| | ``` |