Text-to-Video
File size: 6,056 Bytes
5564c63
3e29fea
 
5564c63
 
 
3e29fea
5564c63
3e29fea
5564c63
3e29fea
5564c63
 
3e29fea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5564c63
 
 
 
 
 
 
 
 
 
3e29fea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c67e12
5564c63
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
base_model:
- VideoCrafter/VideoCrafter2
datasets:
- nkp37/OpenVid-1M
- TempoFunk/webvid-10M
license: gpl-3.0
pipeline_tag: text-to-video
library_name: diffusers
---

# Advanced text-to-video Diffusion Models

This repository contains the model from the paper [AMD-Hummingbird: Towards an Efficient Text-to-Video Model](https://huggingface.co/papers/2503.18559). Hummingbird is a lightweight text-to-video (T2V) framework that prunes existing models (such as VideoCrafter2) and enhances visual quality through visual feedback learning. It aims to improve the efficiency of T2V generation, making it more suitable for deployment on resource-limited devices while preserving high-quality video generation.

## Table of Contents
- [Advanced text-to-video Diffusion Models](#advanced-text-to-video-diffusion-models)
- [Key Features](#key-features)
- [8-Steps Results](#8-steps-results)
- [Checkpoint](#checkpoint)
- [Installation](#installation)
  - [conda](#conda)
  - [docker](#docker)
- [Data Processing](#data-processing)
  - [VQA](#vqa)
  - [Remove Dolly Zoom Videos](#remove-dolly-zoom-videos)
- [Training](#training)
  - [Model Distillation](#model-distillation)
  - [Acceleration Training](#acceleration-training)
- [Inference](#inference)
- [License](#license)


## Key Features

⚡️ This repository provides training recipes for the AMD efficient text-to-video models, which are designed for high performance and efficiency. The training process includes two key steps:

* Distillation and Pruning: We distill and prune the popular text-to-video model [VideoCrafter2](https://github.com/AILab-CVC/VideoCrafter), reducing the parameters to a compact 945M while maintaining competitive performance.

* Optimization with T2V-Turbo: We apply the [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) method on the distilled model to reduce inference steps and further enhance model quality.

This implementation is released to promote further research and innovation in the field of efficient text-to-video generation, optimized for AMD Instinct accelerators.


![Vbench performance](GIFs/vbench.png)




## 8-Steps Results

| Prompt                                       | Generated Video          | Prompt                                         | Generated Video        |
|--------------------------------------------|--------------------------|----------------------------------------------|------------------------|
| A cute happy Corgi playing in park, sunset, pixel. | ![GIF](GIFs/A_cute_happy_Corgi_playing_in_park,_sunset,_pixel_.gif) | A cute happy Corgi playing in park, sunset, animated style. | ![GIF](GIFs/A_cute_happy_Corgi_playing_in_park,_sunset,_animated_style.gif) |
| A quiet beach at dawn and the waves gently lapping. | ![GIF](GIFs/A_quiet_beach_at_dawn_and_the_waves_gently_lapping.gif) | A cute teddy bear, dressed in a red silk outfit, stands in a vibrant street, chinese new year. | ![GIF](GIFs/A_cute_teddy_bear,_dressed_in_a_red_silk_outfit,_stands_in_a_vibrant_street,_chinese_new_year..gif) |
| A cat DJ at a party.                       | ![GIF](GIFs/A_cat_DJ_at_a_party.gif)     | A 3D model of a 1800s victorian house.        | ![GIF](GIFs/A_3D_model_of_a_1800s_victorian_house..gif)    |
| A cute raccoon playing guitar in the beach. | ![GIF](GIFs/A_cute_raccoon_playing_guitar_in_the_beach.gif)     | A cute raccoon playing guitar in the forest. | ![GIF](GIFs/A_cute_raccoon_playing_guitar_in_the_forest.gif)|
| A sandcastle being eroded by the incoming tide. | ![GIF](GIFs/A_sandcastle_being_eroded_by_the_incoming_tide.gif) | An astronaut flying in space, in cyberpunk style. | ![GIF](GIFs/An_astronaut_flying_in_space,_in_cyberpunk_style.gif) |
| A drone flying over a snowy forest.       | ![GIF](GIFs/a_drone_flying_over_a_snowy_forest.gif)       | A ghost ship navigating through a sea under a moon. | ![GIF](GIFs/A_ghost_ship_navigating_through_a_sea_under_a_moon.gif) |


# Checkpoint
Our pretrained checkpoint can be downloaded from [HuggingFace](https://huggingface.co/amd/AMD-Hummingbird-T2V/tree/main)

# Installation
We train both 0.9B and 0.7 T2V models on MI250 and evalute them on MI250, MI300, RTX7900xt and RadeonTM 880M RyzenTM AI 9 365 Ubuntu 6.8.0-51-generic.

## conda
```
conda create -n AMD_Hummingbird python=3.10
conda activate AMD_Hummingbird
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/rocm6.1
pip install -r requirements.txt
```
For rocm flash-attn, you can install it by this [link](https://github.com/ROCm/flash-attention).
```
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
python setup.py install
```
It will take about 1.5 hours to install.

## docker
First, you should use `docker pull` to download the image.
```
docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
```
Second, you can use  `docker run` to run the image, for example:
```
docker run \
        -v "$(pwd):/workspace" \
        --device=/dev/kfd \
        --device=/dev/dri \
        -it \
        --network=host \
        --name hummingbird \
        rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
```
When you in the container, you can use `pip` to install other dependencies:
```
pip install -r requirements.txt
```

# Data Processing

## VQA
```
cd data_pre_process/DOVER
sh run.sh
```
Then you can get a score table for all video qualities, sort according to the table, and remove low-scoring videos.
## Remove Dolly Zoom Videos
```
cd data_pre_process/VBench
sh run.sh 
```
According to the motion smoothness score csv file, you can  remove low-scoring videos.
# Training

## Model Distillation

```
sh configs/training_512_t2v_v1.0/run_distill.sh
```


## Acceleration Training

```
cd acceleration/t2v-turbo

# for 0.7 B model
sh train_07B.sh

# for 0.9 B model
sh train_09B.sh
```


# Inference

```
# for 0.7B model
python inference_command_config_07B.py

# for 0.9B model
python inference_command_config_09B.py
```

# License
Copyright (c) 2024 Advanced Micro Devices, Inc. All Rights Reserved.