sbapan41 commited on
Commit
c24a30f
Β·
verified Β·
1 Parent(s): ebead3a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +137 -3
README.md CHANGED
@@ -1,3 +1,137 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-to-video
6
+ tags:
7
+ - video generation
8
+ library_name: diffusers
9
+ inference:
10
+ parameters:
11
+ num_inference_steps: 10
12
+ ---
13
+ # Text2Motion
14
+
15
+
16
+ -----
17
+
18
+ [**Text2Motion: Open and Advanced Large-Scale Video Generative Models**]("") <be>
19
+
20
+ In this repository, we present **Text2Motion**, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. **Text2Motion** offers these key features:
21
+ - πŸ‘ **SOTA Performance**: **Text2Motion** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
22
+ - πŸ‘ **Supports Consumer-grade GPUs**: The T2V-1.3B model requires only 8.19 GB VRAM, making it compatible with almost all consumer-grade GPUs. It can generate a 5-second 480P video on an RTX 4090 in about 4 minutes (without optimization techniques like quantization). Its performance is even comparable to some closed-source models.
23
+ - πŸ‘ **Multiple Tasks**: **Text2Motion** excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation.
24
+ - πŸ‘ **Visual Text Generation**: **Text2Motion** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
25
+ - πŸ‘ **Powerful Video VAE**: **Text2Motion-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
26
+
27
+ This repository features our T2V-14B model, which establishes a new SOTA performance benchmark among both open-source and closed-source models. It demonstrates exceptional capabilities in generating high-quality visuals with significant motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.
28
+
29
+
30
+ ## πŸ”₯ Latest News!!
31
+
32
+ * Feb 22, 2025: πŸ‘‹ We've released the inference code and weights of Text2Motion.
33
+
34
+
35
+ ## πŸ“‘ Todo List
36
+ - Text2Motion Text-to-Video
37
+ - [x] Multi-GPU Inference code of the 14B
38
+ - [x] Checkpoints of the 14B
39
+ - [x] Gradio demo
40
+ - [ ] Diffusers integration
41
+ - [ ] ComfyUI integration
42
+ - Text2Motion Image-to-Video
43
+ - [x] Multi-GPU Inference code of the 14B model
44
+ - [x] Checkpoints of the 14B model
45
+ - [x] Gradio demo
46
+ - [ ] Diffusers integration
47
+ - [ ] ComfyUI integration
48
+
49
+
50
+ ## Quickstart
51
+
52
+ #### Installation
53
+ Clone the repo:
54
+ ```
55
+ git clone https://huggingface.co/sbapan41/Text2Motion
56
+ cd Text2Motion
57
+ ```
58
+
59
+ Install dependencies:
60
+ ```
61
+ # Ensure torch >= 2.4.0
62
+ pip install -r requirements.txt
63
+ ```
64
+
65
+
66
+ #### Model Download
67
+
68
+ | Models | Download Link | Notes |
69
+ | --------------|-------------------------------------------------------------------------------|-------------------------------|
70
+ | T2V-14B | πŸ€— [Huggingface](https://huggingface.co/sbapan41/Text2Motion) | Supports both 480P and 720P |
71
+
72
+
73
+
74
+
75
+ Download models using πŸ€— huggingface-cli:
76
+ ```
77
+ pip install "huggingface_hub[cli]"
78
+ huggingface-cli download sbapan41/Text2Motion --local-dir ./Text2Motion
79
+ ```
80
+ #### Run Text-to-Video Generation
81
+
82
+ This repository supports two Text-to-Video models (14B) and two resolutions (480P and 720P). The parameters and configurations for these models are as follows:
83
+
84
+ <table>
85
+ <thead>
86
+ <tr>
87
+ <th rowspan="2">Task</th>
88
+ <th colspan="2">Resolution</th>
89
+ <th rowspan="2">Model</th>
90
+ </tr>
91
+ <tr>
92
+ <th>480P</th>
93
+ <th>720P</th>
94
+ </tr>
95
+ </thead>
96
+ <tbody>
97
+ <tr>
98
+ <td>t2v-14B</td>
99
+ <td style="color: green;">βœ”οΈ</td>
100
+ <td style="color: green;">βœ”οΈ</td>
101
+ <td>Text2Motion-T2V-14B</td>
102
+ </tr>
103
+ <tr>
104
+ <td>t2v-1.3B</td>
105
+ <td style="color: green;">βœ”οΈ</td>
106
+ <td style="color: red;">❌</td>
107
+ <td>Text2Motion-T2V-1.3B</td>
108
+ </tr>
109
+ </tbody>
110
+ </table>
111
+
112
+
113
+ ##### (1) Without Prompt Extention
114
+
115
+ To facilitate implementation, we will start with a basic version of the inference process that skips the [prompt extension](#2-using-prompt-extention) step.
116
+
117
+ - Single-GPU inference
118
+
119
+ ```
120
+ python generate.py --task 14B --size 1280*720 --ckpt_dir ./Text2Motion --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
121
+ ```
122
+
123
+ If you encounter OOM (Out-of-Memory) issues, you can use the `--offload_model True` and `--t5_cpu` options to reduce GPU memory usage. For example, on an RTX 4090 GPU:
124
+
125
+ ```
126
+
127
+ - Multi-GPU inference using FSDP + xDiT USP
128
+
129
+ ```
130
+ pip install "xfuser>=0.4.1"
131
+ torchrun --nproc_per_node=8 generate.py --task 14B --size 1280*720 --ckpt_dir ./Text2Motion --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
132
+ ```
133
+
134
+
135
+ | Model | Dimension | Input Dimension | Output Dimension | Feedforward Dimension | Frequency Dimension | Number of Heads | Number of Layers |
136
+ |--------|-----------|-----------------|------------------|-----------------------|---------------------|-----------------|------------------|
137
+ | 14B | 5120 | 16 | 16 | 13824 | 256 | 40 | 40 |