Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,7 @@ license_link: LICENSE
|
|
| 7 |
<!-- ## **HunyuanVideo** -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/logo.png" height=100>
|
| 11 |
</p>
|
| 12 |
|
| 13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
|
@@ -48,7 +48,7 @@ using a large language model, and used as the condition. Gaussian noise and cond
|
|
| 48 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
| 49 |
the 3D VAE decoder.
|
| 50 |
<p align="center">
|
| 51 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png" height=300>
|
| 52 |
</p>
|
| 53 |
|
| 54 |
## 🎉 **HunyuanVideo Key Features**
|
|
@@ -60,7 +60,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal
|
|
| 60 |
This design captures complex interactions between visual and semantic information, enhancing
|
| 61 |
overall model performance.
|
| 62 |
<p align="center">
|
| 63 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png" height=350>
|
| 64 |
</p>
|
| 65 |
|
| 66 |
### **MLLM Text Encoder**
|
|
@@ -68,13 +68,13 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te
|
|
| 68 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
| 69 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
| 70 |
<p align="center">
|
| 71 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png" height=275>
|
| 72 |
</p>
|
| 73 |
|
| 74 |
### **3D VAE**
|
| 75 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
| 76 |
<p align="center">
|
| 77 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png" height=150>
|
| 78 |
</p>
|
| 79 |
|
| 80 |
### **Prompt Rewrite**
|
|
|
|
| 7 |
<!-- ## **HunyuanVideo** -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/logo.png" height=100>
|
| 11 |
</p>
|
| 12 |
|
| 13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
|
|
|
| 48 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
| 49 |
the 3D VAE decoder.
|
| 50 |
<p align="center">
|
| 51 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/overall.png" height=300>
|
| 52 |
</p>
|
| 53 |
|
| 54 |
## 🎉 **HunyuanVideo Key Features**
|
|
|
|
| 60 |
This design captures complex interactions between visual and semantic information, enhancing
|
| 61 |
overall model performance.
|
| 62 |
<p align="center">
|
| 63 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/backbone.png" height=350>
|
| 64 |
</p>
|
| 65 |
|
| 66 |
### **MLLM Text Encoder**
|
|
|
|
| 68 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
| 69 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
| 70 |
<p align="center">
|
| 71 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/text_encoder.png" height=275>
|
| 72 |
</p>
|
| 73 |
|
| 74 |
### **3D VAE**
|
| 75 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
| 76 |
<p align="center">
|
| 77 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/3dvae.png" height=150>
|
| 78 |
</p>
|
| 79 |
|
| 80 |
### **Prompt Rewrite**
|