GrayShine commited on
Commit
6b48f37
·
verified ·
1 Parent(s): d8e8a22

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -3
README.md CHANGED
@@ -1,3 +1,146 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">Video-GPT via Next Clip Diffusion</h1>
2
+
3
+
4
+
5
+ <div align="center">
6
+
7
+ [![arXiv](https://img.shields.io/badge/arXiv-2505.12489-b31b1b.svg)](https://arxiv.org/abs/2505.12489)
8
+ [![Project Page](https://img.shields.io/badge/Project_Page-Video--GPT-green)](https://zhuangshaobin.github.io/Video-GPT.github.io/)
9
+ [![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/GrayShine/Video-GPT)
10
+
11
+ </div>
12
+
13
+
14
+ <h4 align="center">
15
+ <p>
16
+ <a href=#1-news>News</a> |
17
+ <a href=#2-overview>Overview</a> |
18
+ <a href=#3-methodology>Methodology</a> |
19
+ <a href=#4-what-can-video-gpt-do>Capabilities</a> |
20
+ <a href=#5-quick-start>Quick Start</a> |
21
+ <a href="#6-training">Finetune</a> |
22
+ <a href="#acknowledgement">Acknowledgement</a> |
23
+ <a href="#license">License</a> |
24
+ <a href="#citation">Citation</a>
25
+ <p>
26
+ </h4>
27
+
28
+
29
+
30
+ ## 1. News
31
+ - 2025-5-16:✨✨We release our 4 stages prograssive training code (supporting Huawei's NPU and NVIDIA's GPU). You can refer to [LVM/script/train](LVM/script/train) and [LVM/train](LVM/train) for detailed training information.
32
+ - 2025-5-16:✨✨We release the inference code in [LVM/script/inference](LVM/script/inference) and [LVM/inference](LVM/inference).
33
+ - 2025-5-16:🔥🔥We release the first version of Video-GPT. Model Weight: [Video-GPT](https://huggingface.co/GrayShine/Video-GPT)
34
+
35
+
36
+ ## 2. Overview
37
+
38
+ Video-GPT is a video self-supervised generative pre-trained model which treats video as new language for visual world modeling by next clip diffusion. It is designed to be simple, flexible, and easy to follow. We provide [inference code](#5-quick-start) so that everyone can explore more functionalities of Video-GPT.
39
+
40
+ Previous works on visual generation relies heavily on supervisory signals from textual modalities (such as Sora, WanX, HunyuanVideo, MovieGen). However, vision, as a natural ability of human beings, was formed even earlier than language. Therefore, we believe that the information of the visual modality itself is sufficient to support the model to model the world.
41
+
42
+ ![demo](https://github.com/zhuangshaobin/Video-GPT/tree/main/imgs/teaser.png)
43
+
44
+ In addition, compared with the previous model architecture with many special designs for diffusion model (e.g., UNet, DiT, MM-DiT), we adopted the simplest vanilla transformer architecture. On the one hand, it is more conducive to the exploration of scaling law in the future. On the other hand, it is also more convenient for the community to follow up.
45
+
46
+ Due to the limited resources, Video-GPT still has room for improvement. We will continue to optimize it, and hope it inspires more universal video generative foundation models.
47
+
48
+ If you have any questions, ideas, or interesting tasks you want Video-GPT to accomplish, feel free to discuss with us: hahahahaha@sjtu.edu.cn. We welcome any feedback to help us improve the model.
49
+
50
+
51
+ ## 3. Methodology
52
+
53
+ You can see details in our [paper](https://arxiv.org/abs/2409.11340).
54
+
55
+
56
+ ## 4. What Can Video-GPT do?
57
+
58
+ Video-GPT is a video self-supervised generative pre-trained model that you can use to perform various tasks. It can be directly applied to video prediction, or fine-tuned to tasks such as video object segmentation and image animation with very little data. Its intermediate layer features are also suitable for representation learning.
59
+
60
+ Here is the illustrations of Video-GPT's capabilities:
61
+ - Powerful world modeling capabilities
62
+ ![demo](https://github.com/zhuangshaobin/Video-GPT/tree/main/imgs/phys_visual.png)
63
+ - Based on the pre-trained Video-GPT, we continue training on class to video and text to video tasks, and can achieve better results than training from scratch.
64
+ ![demo](https://github.com/zhuangshaobin/Video-GPT/tree/main/imgs/c2v_gen.png)
65
+ ![demo](https://github.com/zhuangshaobin/Video-GPT/tree/main/imgs/t2v_gen.png)
66
+
67
+ - By fine-tuning with a small amount of data, Video-GPT can also achieve good generalization performance on downstream tasks.
68
+ ![demo](https://github.com/zhuangshaobin/Video-GPT/tree/mainimgs/anim.png)
69
+ ![demo](https://github.com/zhuangshaobin/Video-GPT/tree/mainimgs/seg.png)
70
+
71
+
72
+
73
+ ## 5. Quick Start
74
+
75
+
76
+ ### Using Video-GPT
77
+ Install via Github:
78
+ ```bash
79
+ git clone https://github.com/zhuangshaobin/Video-GPT.git
80
+ cd Video-GPT
81
+ ```
82
+ If you are using GPUs from NVIDIA, then
83
+ ```bash
84
+ bash env_nv.sh
85
+ ```
86
+ If you are using NPUs from Huawei, then
87
+ ```bash
88
+ bash env_hw.sh
89
+ ```
90
+
91
+
92
+ Then you can use the following command to extract the first few frames of the video for video prediction.
93
+ If you are using GPUs from NVIDIA, then
94
+ ```bash
95
+ bash LVM/script/inference/inference_nv.sh
96
+ ```
97
+ If you are using NPUs from Huawei, then
98
+ ```bash
99
+ bash LVM/script/inference/inference_hw.sh
100
+ ```
101
+
102
+
103
+ ## 6. Pre-Training
104
+ We provide our 4 stage training script to train or fine-tune Video-GPT.
105
+ If you are using GPUs from NVIDIA, then
106
+ ```bash
107
+ # 1-stage pre-training
108
+ bash LVM/script/train/pretrain_stage1_nv.sh
109
+ # 2-stage pre-training
110
+ bash LVM/script/train/pretrain_stage2_nv.sh
111
+ # 3-stage pre-training
112
+ bash LVM/script/train/pretrain_stage3_nv.sh
113
+ # 4-stage pre-training
114
+ bash LVM/script/train/pretrain_stage4_nv.sh
115
+ ```
116
+
117
+ If you are using NPUs from Huawei, then
118
+ ```bash
119
+ # 1-stage pre-training
120
+ bash LVM/script/train/pretrain_stage1_hw.sh
121
+ # 2-stage pre-training
122
+ bash LVM/script/train/pretrain_stage2_hw.sh
123
+ # 3-stage pre-training
124
+ bash LVM/script/train/pretrain_stage3_hw.sh
125
+ # 4-stage pre-training
126
+ bash LVM/script/train/pretrain_stage4_hw.sh
127
+ ```
128
+
129
+
130
+ ## Acknowledgement
131
+ We built our repository based on the repository of [OmniGen](https://github.com/VectorSpaceLab/OmniGen), which also did a great job!
132
+
133
+ ## License
134
+ This repo is licensed under the [MIT License](LICENSE).
135
+
136
+
137
+ ## Citation
138
+ If you find this repository useful, please consider giving a star ⭐ and citation
139
+ ```
140
+ @article{zhuang2025videogptclipdiffusion,
141
+ title={Video-GPT via Next Clip Diffusion},
142
+ author={Shaobin Zhuang and Zhipeng Huang and Ying Zhang and Fangyikang Wang and Canmiao Fu and Binxin Yang and Chong Sun and Chen Li and Yali Wang},
143
+ journal={arXiv preprint arXiv:2505.12489},
144
+ year={2025}
145
+ }
146
+ ```