Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -7,50 +7,32 @@ license_link: LICENSE
|
|
| 7 |
<!-- ## **HunyuanVideo** -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
<img src="https://
|
| 11 |
</p>
|
| 12 |
|
| 13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
| 14 |
|
| 15 |
-----
|
| 16 |
|
| 17 |
-
This repo contains
|
| 18 |
|
| 19 |
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model Training**](https://github.com/Tencent/HunyuanVideo/blob/main/assets/hunyuanvideo.pdf) <br>
|
| 20 |
|
| 21 |
-
Due to the limitation of github page, the video is compressed. The original video can be downloaded from [here](https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/material/demo.mov).
|
| 22 |
|
| 23 |
-
## π₯π₯π₯ News!!
|
| 24 |
-
* Dec 3, 2024: π€ We release the inference code and model weights of HunyuanVideo.
|
| 25 |
-
|
| 26 |
-
## π Open-source Plan
|
| 27 |
-
|
| 28 |
-
- HunyuanVideo (Text-to-Video Model)
|
| 29 |
-
- [x] Inference
|
| 30 |
-
- [x] Checkpoints
|
| 31 |
-
- [ ] Penguin Video Benchmark
|
| 32 |
-
- [ ] Web Demo (Gradio)
|
| 33 |
-
- [ ] ComfyUI
|
| 34 |
-
- [ ] Diffusers
|
| 35 |
-
- HunyuanVideo (Image-to-Video Model)
|
| 36 |
-
- [ ] Inference
|
| 37 |
-
- [ ] Checkpoints
|
| 38 |
|
| 39 |
## Contents
|
| 40 |
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model Training](#hunyuanvideo--a-systematic-framework-for-large-video-generation-model-training)
|
| 41 |
-
- [π₯π₯π₯ News!!](#-news!!)
|
| 42 |
-
- [π Open-source Plan](#-open-source-plan)
|
| 43 |
- [Contents](#contents)
|
| 44 |
- [**Abstract**](#abstract)
|
| 45 |
-
- [**HunyuanVideo Overall Architechture**](
|
| 46 |
-
- [π **HunyuanVideo Key Features**](
|
| 47 |
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
| 48 |
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
| 49 |
- [**3D VAE**](#3d-vae)
|
| 50 |
- [**Prompt Rewrite**](#prompt-rewrite)
|
| 51 |
-
- [π Comparisons](
|
| 52 |
-
- [π BibTeX](
|
| 53 |
-
- [Acknowledgements](
|
| 54 |
---
|
| 55 |
|
| 56 |
## **Abstract**
|
|
@@ -66,7 +48,7 @@ using a large language model, and used as the condition. Gaussian noise and cond
|
|
| 66 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
| 67 |
the 3D VAE decoder.
|
| 68 |
<p align="center">
|
| 69 |
-
<img src="https://
|
| 70 |
</p>
|
| 71 |
|
| 72 |
## π **HunyuanVideo Key Features**
|
|
@@ -78,7 +60,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal
|
|
| 78 |
This design captures complex interactions between visual and semantic information, enhancing
|
| 79 |
overall model performance.
|
| 80 |
<p align="center">
|
| 81 |
-
<img src="https://
|
| 82 |
</p>
|
| 83 |
|
| 84 |
### **MLLM Text Encoder**
|
|
@@ -86,19 +68,19 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te
|
|
| 86 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
| 87 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
| 88 |
<p align="center">
|
| 89 |
-
<img src="https://
|
| 90 |
</p>
|
| 91 |
|
| 92 |
### **3D VAE**
|
| 93 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
| 94 |
<p align="center">
|
| 95 |
-
<img src="https://
|
| 96 |
</p>
|
| 97 |
|
| 98 |
### **Prompt Rewrite**
|
| 99 |
To address the variability in linguistic style and length of user-provided prompts, we fine-tune the [Hunyuan-Large model](https://github.com/Tencent/Tencent-Hunyuan-Large) as our prompt rewrite model to adapt the original user prompt to model-preferred prompt.
|
| 100 |
|
| 101 |
-
We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The
|
| 102 |
|
| 103 |
The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyuan-Large original code](https://github.com/Tencent/Tencent-Hunyuan-Large). We release the weights of the Prompt Rewrite Model [here](https://huggingface.co/Tencent/HunyuanVideo-PromptRewrite).
|
| 104 |
|
|
|
|
| 7 |
<!-- ## **HunyuanVideo** -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/logo.png" height=100>
|
| 11 |
</p>
|
| 12 |
|
| 13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
| 14 |
|
| 15 |
-----
|
| 16 |
|
| 17 |
+
This repo contains the weights of HunyuanVideo-PromptRewrite model. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com).
|
| 18 |
|
| 19 |
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model Training**](https://github.com/Tencent/HunyuanVideo/blob/main/assets/hunyuanvideo.pdf) <br>
|
| 20 |
|
|
|
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Contents
|
| 24 |
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model Training](#hunyuanvideo--a-systematic-framework-for-large-video-generation-model-training)
|
|
|
|
|
|
|
| 25 |
- [Contents](#contents)
|
| 26 |
- [**Abstract**](#abstract)
|
| 27 |
+
- [**HunyuanVideo Overall Architechture**](#-hunyuanvideo-overall-architechture)
|
| 28 |
+
- [π **HunyuanVideo Key Features**](#-hunyuanvideo-key-features)
|
| 29 |
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
| 30 |
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
| 31 |
- [**3D VAE**](#3d-vae)
|
| 32 |
- [**Prompt Rewrite**](#prompt-rewrite)
|
| 33 |
+
- [π Comparisons](#-comparisons)
|
| 34 |
+
- [π BibTeX](#-bibtex)
|
| 35 |
+
- [Acknowledgements](#-acknowledgements)
|
| 36 |
---
|
| 37 |
|
| 38 |
## **Abstract**
|
|
|
|
| 48 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
| 49 |
the 3D VAE decoder.
|
| 50 |
<p align="center">
|
| 51 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png" height=300>
|
| 52 |
</p>
|
| 53 |
|
| 54 |
## π **HunyuanVideo Key Features**
|
|
|
|
| 60 |
This design captures complex interactions between visual and semantic information, enhancing
|
| 61 |
overall model performance.
|
| 62 |
<p align="center">
|
| 63 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png" height=350>
|
| 64 |
</p>
|
| 65 |
|
| 66 |
### **MLLM Text Encoder**
|
|
|
|
| 68 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
| 69 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
| 70 |
<p align="center">
|
| 71 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png" height=275>
|
| 72 |
</p>
|
| 73 |
|
| 74 |
### **3D VAE**
|
| 75 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
| 76 |
<p align="center">
|
| 77 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png" height=150>
|
| 78 |
</p>
|
| 79 |
|
| 80 |
### **Prompt Rewrite**
|
| 81 |
To address the variability in linguistic style and length of user-provided prompts, we fine-tune the [Hunyuan-Large model](https://github.com/Tencent/Tencent-Hunyuan-Large) as our prompt rewrite model to adapt the original user prompt to model-preferred prompt.
|
| 82 |
|
| 83 |
+
We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The Normal mode is designed to enhance the video generation model's comprehension of user intent, facilitating a more accurate interpretation of the instructions provided. The Master mode enhances the description of aspects such as composition, lighting, and camera movement, which leans towards generating videos with a higher visual quality. However, this emphasis may occasionally result in the loss of some semantic details.
|
| 84 |
|
| 85 |
The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyuan-Large original code](https://github.com/Tencent/Tencent-Hunyuan-Large). We release the weights of the Prompt Rewrite Model [here](https://huggingface.co/Tencent/HunyuanVideo-PromptRewrite).
|
| 86 |
|