microsoft
/

Reducio-VAE

Video-Generation

Model card Files Files and versions

daiqi commited on Nov 19, 2024

Commit

c026e54

·

verified ·

1 Parent(s): 8ba816b

Update README.md

Files changed (1) hide show

README.md +77 -3

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: image-to-video
+tags:
+- VAE
+- Video-Generation
+---
+# Reducio-VAE Model Card
+<!-- Provide a quick summary of what the model is/does. -->
+This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of $\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}$, enabling $4096\times$ downsampling.
+It is part of the [Reducio-DiT](https://arxiv.org/abs/xxxx), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE).
+## Model Details
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** [GitHub Repository](https://github.com/microsoft/Reducio-VAE)
+- **Paper:** [arXiv](https://arxiv.org/abs/xxxx)
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE/Readme.md).
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.
+## Usage Example
+Use the code below to get started with the model.
+```python
+import torch
+```
+## Results
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Results
+Metrics on 1K Pexels validation set and UCF-101:
+|Method|Downsample Factor|$\|z\|$|PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)|
+|---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------|
+|SD2.1-VAE|$1\times8\times8$|4|29.23|0.82|0.09|25.96|21.00|
+|SDXL-VAE|$1\times8\times8$|16|30.54|0.85|0.08|19.87|23.68|
+|OmniTokenizer|$4\times8\times8$|8|27.11|0.89|0.07|23.88|30.52|
+|OpenSora-1.2|$4\times8\times8$|16|30.72|0.85|0.11|60.88|67.52|
+|Cosmos Tokenizer|$8\times8\times8$|16|30.84|0.74|0.12|29.44|22.06|
+|Cosmos Tokenizer|$8\times16\times16$|16|28.14|0.65|0.18|77.87|119.37|
+|Reducio-VAE|$4\times32\times32$|16|35.88|0.94|0.05|17.88|65.17|
+## Citation
+**BibTeX:**
+```
+@article{tian2024reducio,
+      title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents},
+      author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
+      journal={arXiv preprint arXiv:xxxx},
+      year={2024}
+}
+```