Update README.md
Browse files
README.md
CHANGED
|
@@ -4,12 +4,55 @@ license: apache-2.0
|
|
| 4 |
|
| 5 |
# ContentV: Efficient Training of Video Generation Models with Limited Compute
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
#### Recommended PyTorch Version
|
| 15 |
|
|
@@ -18,32 +61,61 @@ This project presents ContentV, a novel framework that accelerates DiT-based vid
|
|
| 18 |
|
| 19 |
#### Installation
|
| 20 |
|
| 21 |
-
```
|
| 22 |
git clone https://github.com/bytedance/ContentV.git
|
| 23 |
-
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
#### T2V Generation
|
| 27 |
|
| 28 |
-
```
|
| 29 |
-
cd ContentV
|
| 30 |
## For GPU
|
| 31 |
python3 demo.py
|
| 32 |
## For NPU
|
| 33 |
USE_ASCEND_NPU=1 python3 demo.py
|
| 34 |
```
|
| 35 |
|
| 36 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
- [x] Inference code and checkpoints
|
| 38 |
- [ ] Training code of RLHF
|
| 39 |
|
| 40 |
-
## License
|
| 41 |
This code repository and part of the model weights are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Please note that:
|
| 42 |
- MM DiT are derived from [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) and trained with video samples. This Stability AI Model is licensed under the [Stability AI Community License](https://stability.ai/community-license-agreement), Copyright © Stability AI Ltd. All Rights Reserved
|
| 43 |
- Video VAE from [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) is licensed under [Apache 2.0 License](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/LICENSE.txt)
|
| 44 |
|
| 45 |
-
## Acknowledgement
|
| 46 |
* [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)
|
| 47 |
* [Wan2.1](https://github.com/Wan-Video/Wan2.1)
|
| 48 |
* [Diffusers](https://github.com/huggingface/diffusers)
|
| 49 |
* [HuggingFace](https://huggingface.co)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
# ContentV: Efficient Training of Video Generation Models with Limited Compute
|
| 6 |
|
| 7 |
+
<div align="center">
|
| 8 |
+
<p align="center">
|
| 9 |
+
<a href="https://contentv.github.io">
|
| 10 |
+
<img
|
| 11 |
+
src="https://img.shields.io/badge/Demo-Project Page-0A66C2?logo=googlechrome&logoColor=blue"
|
| 12 |
+
alt="Project Page"
|
| 13 |
+
/>
|
| 14 |
+
</a>
|
| 15 |
+
<!-- <a>
|
| 16 |
+
<img
|
| 17 |
+
src="https://img.shields.io/badge/Tech Report-ArXiv-red?logo=arxiv&logoColor=red"
|
| 18 |
+
alt="Tech Report"
|
| 19 |
+
/>
|
| 20 |
+
</a> -->
|
| 21 |
+
<a href="https://huggingface.co/ByteDance/ContentV-8B">
|
| 22 |
+
<img
|
| 23 |
+
src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface&logoColor=yellow"
|
| 24 |
+
alt="Model"
|
| 25 |
+
/>
|
| 26 |
+
</a>
|
| 27 |
+
<a href="https://github.com/bytedance/ContentV">
|
| 28 |
+
<img
|
| 29 |
+
src="https://img.shields.io/badge/Code-GitHub-orange?logo=github&logoColor=white"
|
| 30 |
+
alt="Code"
|
| 31 |
+
/>
|
| 32 |
+
</a>
|
| 33 |
+
<a href="https://www.apache.org/licenses/LICENSE-2.0">
|
| 34 |
+
<img
|
| 35 |
+
src="https://img.shields.io/badge/License-Apache 2.0-5865F2?logo=apache&logoColor=purple"
|
| 36 |
+
alt="License"
|
| 37 |
+
/>
|
| 38 |
+
</a>
|
| 39 |
+
</p>
|
| 40 |
+
</div>
|
| 41 |
|
| 42 |
+
This project presents *ContentV*, an efficient framework for accelerating the training of DiT-based video generation models through three key innovations:
|
| 43 |
+
|
| 44 |
+
- A minimalist architecture that maximizes reuse of pre-trained image generation models for video synthesis
|
| 45 |
+
- A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency
|
| 46 |
+
- A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations
|
| 47 |
+
|
| 48 |
+
Our 8B model achieves state-of-the-art result (85.14 on VBench) in only 4 weeks of training with 256×64GB NPUs.
|
| 49 |
+
|
| 50 |
+
<div align="center">
|
| 51 |
+
<img src="https://raw.githubusercontent.com/bytedance/ContentV/refs/heads/main/assets/demo.jpg" width="100%">
|
| 52 |
+
<img src="https://raw.githubusercontent.com/bytedance/ContentV/refs/heads/main/assets/arch.png" width="100%">
|
| 53 |
+
</div>
|
| 54 |
+
|
| 55 |
+
## ⚡ Quickstart
|
| 56 |
|
| 57 |
#### Recommended PyTorch Version
|
| 58 |
|
|
|
|
| 61 |
|
| 62 |
#### Installation
|
| 63 |
|
| 64 |
+
```bash
|
| 65 |
git clone https://github.com/bytedance/ContentV.git
|
| 66 |
+
cd ContentV
|
| 67 |
+
pip3 install -r requirements.txt
|
| 68 |
```
|
| 69 |
|
| 70 |
#### T2V Generation
|
| 71 |
|
| 72 |
+
```bash
|
|
|
|
| 73 |
## For GPU
|
| 74 |
python3 demo.py
|
| 75 |
## For NPU
|
| 76 |
USE_ASCEND_NPU=1 python3 demo.py
|
| 77 |
```
|
| 78 |
|
| 79 |
+
## 📊 VBench
|
| 80 |
+
|
| 81 |
+
| Model | Total Score | Quality Score | Semantic Score | Human Action | Scene | Dynamic Degree | Multiple Objects | Appear. Style |
|
| 82 |
+
|----------------------|--------|-------|-------|-------|-------|-------|-------|-------|
|
| 83 |
+
| Wan2.1-14B | 86.22 | 86.67 | 84.44 | 99.20 | 61.24 | 94.26 | 86.59 | 21.59 |
|
| 84 |
+
| **ContentV (Long)** | 85.14 | 86.64 | 79.12 | 96.80 | 57.38 | 83.05 | 71.41 | 23.02 |
|
| 85 |
+
| Goku† | 84.85 | 85.60 | 81.87 | 97.60 | 57.08 | 76.11 | 79.48 | 23.08 |
|
| 86 |
+
| Open-Sora 2.0 | 84.34 | 85.40 | 80.12 | 95.40 | 52.71 | 71.39 | 77.72 | 22.98 |
|
| 87 |
+
| Sora† | 84.28 | 85.51 | 79.35 | 98.20 | 56.95 | 79.91 | 70.85 | 24.76 |
|
| 88 |
+
| **ContentV (Short)** | 84.11 | 86.23 | 75.61 | 89.60 | 44.02 | 79.26 | 74.58 | 21.21 |
|
| 89 |
+
| EasyAnimate 5.1 | 83.42 | 85.03 | 77.01 | 95.60 | 54.31 | 57.15 | 66.85 | 23.06 |
|
| 90 |
+
| Kling 1.6† | 83.40 | 85.00 | 76.99 | 96.20 | 55.57 | 62.22 | 63.99 | 20.75 |
|
| 91 |
+
| HunyuanVideo | 83.24 | 85.09 | 75.82 | 94.40 | 53.88 | 70.83 | 68.55 | 19.80 |
|
| 92 |
+
| CogVideoX-5B | 81.61 | 82.75 | 77.04 | 99.40 | 53.20 | 70.97 | 62.11 | 24.91 |
|
| 93 |
+
| Pika-1.0† | 80.69 | 82.92 | 71.77 | 86.20 | 49.83 | 47.50 | 43.08 | 22.26 |
|
| 94 |
+
| VideoCrafter-2.0 | 80.44 | 82.20 | 73.42 | 95.00 | 55.29 | 42.50 | 40.66 | 25.13 |
|
| 95 |
+
| AnimateDiff-V2 | 80.27 | 82.90 | 69.75 | 92.60 | 50.19 | 40.83 | 36.88 | 22.42 |
|
| 96 |
+
| OpenSora 1.2 | 79.23 | 80.71 | 73.30 | 85.80 | 42.47 | 47.22 | 58.41 | 23.89 |
|
| 97 |
+
|
| 98 |
+
## ✅ Todo List
|
| 99 |
- [x] Inference code and checkpoints
|
| 100 |
- [ ] Training code of RLHF
|
| 101 |
|
| 102 |
+
## 🧾 License
|
| 103 |
This code repository and part of the model weights are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Please note that:
|
| 104 |
- MM DiT are derived from [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) and trained with video samples. This Stability AI Model is licensed under the [Stability AI Community License](https://stability.ai/community-license-agreement), Copyright © Stability AI Ltd. All Rights Reserved
|
| 105 |
- Video VAE from [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) is licensed under [Apache 2.0 License](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/LICENSE.txt)
|
| 106 |
|
| 107 |
+
## ❤️ Acknowledgement
|
| 108 |
* [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)
|
| 109 |
* [Wan2.1](https://github.com/Wan-Video/Wan2.1)
|
| 110 |
* [Diffusers](https://github.com/huggingface/diffusers)
|
| 111 |
* [HuggingFace](https://huggingface.co)
|
| 112 |
+
|
| 113 |
+
## 🔗 Citation
|
| 114 |
+
|
| 115 |
+
```bibtex
|
| 116 |
+
@article{contentv2025,
|
| 117 |
+
title = {ContentV: Efficient Training of Video Generation Models with Limited Compute},
|
| 118 |
+
author = {Bytedance Douyin Content Team},
|
| 119 |
+
year = {2025},
|
| 120 |
+
}
|
| 121 |
+
```
|