linwf commited on
Commit
3262367
·
verified ·
1 Parent(s): 2fadbb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -12
README.md CHANGED
@@ -4,12 +4,55 @@ license: apache-2.0
4
 
5
  # ContentV: Efficient Training of Video Generation Models with Limited Compute
6
 
7
- This project presents ContentV, a novel framework that accelerates DiT-based video generation through three key innovations:
8
- - A minimalist model design that enables effective reuse of pre-trained image generation models for video synthesis
9
- - A comprehensive exploration of a multi-stage, efficient training strategy based on Flow Matching
10
- - A low-cost Reinforcement Learning with Human Feedback (RLHF) approach that further enhances generation quality without the need for additional human annotations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- ## Quickstart
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  #### Recommended PyTorch Version
15
 
@@ -18,32 +61,61 @@ This project presents ContentV, a novel framework that accelerates DiT-based vid
18
 
19
  #### Installation
20
 
21
- ```sh
22
  git clone https://github.com/bytedance/ContentV.git
23
- pip3 install -r ContentV/requirements.txt
 
24
  ```
25
 
26
  #### T2V Generation
27
 
28
- ```sh
29
- cd ContentV
30
  ## For GPU
31
  python3 demo.py
32
  ## For NPU
33
  USE_ASCEND_NPU=1 python3 demo.py
34
  ```
35
 
36
- ## Todo List
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  - [x] Inference code and checkpoints
38
  - [ ] Training code of RLHF
39
 
40
- ## License
41
  This code repository and part of the model weights are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Please note that:
42
  - MM DiT are derived from [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) and trained with video samples. This Stability AI Model is licensed under the [Stability AI Community License](https://stability.ai/community-license-agreement), Copyright © Stability AI Ltd. All Rights Reserved
43
  - Video VAE from [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) is licensed under [Apache 2.0 License](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/LICENSE.txt)
44
 
45
- ## Acknowledgement
46
  * [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)
47
  * [Wan2.1](https://github.com/Wan-Video/Wan2.1)
48
  * [Diffusers](https://github.com/huggingface/diffusers)
49
  * [HuggingFace](https://huggingface.co)
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  # ContentV: Efficient Training of Video Generation Models with Limited Compute
6
 
7
+ <div align="center">
8
+ <p align="center">
9
+ <a href="https://contentv.github.io">
10
+ <img
11
+ src="https://img.shields.io/badge/Demo-Project Page-0A66C2?logo=googlechrome&logoColor=blue"
12
+ alt="Project Page"
13
+ />
14
+ </a>
15
+ <!-- <a>
16
+ <img
17
+ src="https://img.shields.io/badge/Tech Report-ArXiv-red?logo=arxiv&logoColor=red"
18
+ alt="Tech Report"
19
+ />
20
+ </a> -->
21
+ <a href="https://huggingface.co/ByteDance/ContentV-8B">
22
+ <img
23
+ src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface&logoColor=yellow"
24
+ alt="Model"
25
+ />
26
+ </a>
27
+ <a href="https://github.com/bytedance/ContentV">
28
+ <img
29
+ src="https://img.shields.io/badge/Code-GitHub-orange?logo=github&logoColor=white"
30
+ alt="Code"
31
+ />
32
+ </a>
33
+ <a href="https://www.apache.org/licenses/LICENSE-2.0">
34
+ <img
35
+ src="https://img.shields.io/badge/License-Apache 2.0-5865F2?logo=apache&logoColor=purple"
36
+ alt="License"
37
+ />
38
+ </a>
39
+ </p>
40
+ </div>
41
 
42
+ This project presents *ContentV*, an efficient framework for accelerating the training of DiT-based video generation models through three key innovations:
43
+
44
+ - A minimalist architecture that maximizes reuse of pre-trained image generation models for video synthesis
45
+ - A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency
46
+ - A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations
47
+
48
+ Our 8B model achieves state-of-the-art result (85.14 on VBench) in only 4 weeks of training with 256×64GB NPUs.
49
+
50
+ <div align="center">
51
+ <img src="https://raw.githubusercontent.com/bytedance/ContentV/refs/heads/main/assets/demo.jpg" width="100%">
52
+ <img src="https://raw.githubusercontent.com/bytedance/ContentV/refs/heads/main/assets/arch.png" width="100%">
53
+ </div>
54
+
55
+ ## ⚡ Quickstart
56
 
57
  #### Recommended PyTorch Version
58
 
 
61
 
62
  #### Installation
63
 
64
+ ```bash
65
  git clone https://github.com/bytedance/ContentV.git
66
+ cd ContentV
67
+ pip3 install -r requirements.txt
68
  ```
69
 
70
  #### T2V Generation
71
 
72
+ ```bash
 
73
  ## For GPU
74
  python3 demo.py
75
  ## For NPU
76
  USE_ASCEND_NPU=1 python3 demo.py
77
  ```
78
 
79
+ ## 📊 VBench
80
+
81
+ | Model | Total Score | Quality Score | Semantic Score | Human Action | Scene | Dynamic Degree | Multiple Objects | Appear. Style |
82
+ |----------------------|--------|-------|-------|-------|-------|-------|-------|-------|
83
+ | Wan2.1-14B | 86.22 | 86.67 | 84.44 | 99.20 | 61.24 | 94.26 | 86.59 | 21.59 |
84
+ | **ContentV (Long)** | 85.14 | 86.64 | 79.12 | 96.80 | 57.38 | 83.05 | 71.41 | 23.02 |
85
+ | Goku† | 84.85 | 85.60 | 81.87 | 97.60 | 57.08 | 76.11 | 79.48 | 23.08 |
86
+ | Open-Sora 2.0 | 84.34 | 85.40 | 80.12 | 95.40 | 52.71 | 71.39 | 77.72 | 22.98 |
87
+ | Sora† | 84.28 | 85.51 | 79.35 | 98.20 | 56.95 | 79.91 | 70.85 | 24.76 |
88
+ | **ContentV (Short)** | 84.11 | 86.23 | 75.61 | 89.60 | 44.02 | 79.26 | 74.58 | 21.21 |
89
+ | EasyAnimate 5.1 | 83.42 | 85.03 | 77.01 | 95.60 | 54.31 | 57.15 | 66.85 | 23.06 |
90
+ | Kling 1.6† | 83.40 | 85.00 | 76.99 | 96.20 | 55.57 | 62.22 | 63.99 | 20.75 |
91
+ | HunyuanVideo | 83.24 | 85.09 | 75.82 | 94.40 | 53.88 | 70.83 | 68.55 | 19.80 |
92
+ | CogVideoX-5B | 81.61 | 82.75 | 77.04 | 99.40 | 53.20 | 70.97 | 62.11 | 24.91 |
93
+ | Pika-1.0† | 80.69 | 82.92 | 71.77 | 86.20 | 49.83 | 47.50 | 43.08 | 22.26 |
94
+ | VideoCrafter-2.0 | 80.44 | 82.20 | 73.42 | 95.00 | 55.29 | 42.50 | 40.66 | 25.13 |
95
+ | AnimateDiff-V2 | 80.27 | 82.90 | 69.75 | 92.60 | 50.19 | 40.83 | 36.88 | 22.42 |
96
+ | OpenSora 1.2 | 79.23 | 80.71 | 73.30 | 85.80 | 42.47 | 47.22 | 58.41 | 23.89 |
97
+
98
+ ## ✅ Todo List
99
  - [x] Inference code and checkpoints
100
  - [ ] Training code of RLHF
101
 
102
+ ## 🧾 License
103
  This code repository and part of the model weights are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Please note that:
104
  - MM DiT are derived from [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) and trained with video samples. This Stability AI Model is licensed under the [Stability AI Community License](https://stability.ai/community-license-agreement), Copyright © Stability AI Ltd. All Rights Reserved
105
  - Video VAE from [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) is licensed under [Apache 2.0 License](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B/blob/main/LICENSE.txt)
106
 
107
+ ## ❤️ Acknowledgement
108
  * [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)
109
  * [Wan2.1](https://github.com/Wan-Video/Wan2.1)
110
  * [Diffusers](https://github.com/huggingface/diffusers)
111
  * [HuggingFace](https://huggingface.co)
112
+
113
+ ## 🔗 Citation
114
+
115
+ ```bibtex
116
+ @article{contentv2025,
117
+ title = {ContentV: Efficient Training of Video Generation Models with Limited Compute},
118
+ author = {Bytedance Douyin Content Team},
119
+ year = {2025},
120
+ }
121
+ ```