Image-to-Video
yongshun-zhang commited on
Commit
d2297d5
·
verified ·
1 Parent(s): 56d238f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +319 -3
README.md CHANGED
@@ -1,3 +1,319 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
2
+
3
+ <div align="center">
4
+
5
+ **[Yongshun Zhang](yongshun.zhang@shopee.com)\* · [Zhongyi Fan](zhongyi.fan@shopee.com)\* · [Yonghang Zhang](yonghang.zhang@shopee.com) · [Zhangzikang Li](zhangzikang.li@shopee.com) · [Weifeng Chen](weifeng.chen@shopee.com)**
6
+
7
+ **[Zhongwei Feng](zhongwei.feng@shopee.com) · [Chaoyue Wang](daniel.wang@shopee.com)† · [Peng Hou](peng.hou@shopee.com)† · [Anxiang Zeng](zeng0118@e.ntu.edu.sg)†**
8
+
9
+ LLM Team, Shopee Pte. Ltd.
10
+
11
+ \* Equal contribution · † Corresponding authors
12
+
13
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-red)](#)
14
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/MUG-V/MUG-V-inference)
15
+ [![Inference Code](https://img.shields.io/badge/Code-Inference-yellow)](https://github.com/Shopee-MUG/MUG-V)
16
+ [![Training Code](https://img.shields.io/badge/Code-Training-green)](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training)
17
+ [![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://github.com/Shopee-MUG/MUG-V/blob/main/LICENSE)
18
+
19
+ </div>
20
+
21
+
22
+ ## Overview
23
+
24
+ **MUG-V 10B** is a large-scale video generation system built by the **Shopee Multimodal Understanding and Generation (MUG) team**. The core generator is a Diffusion Transformer (DiT) with ~10B parameters trained via flow-matching objectives. We release the complete stack:
25
+ - [**Model weights**](https://huggingface.co/MUG-V/MUG-V-inference)
26
+ - [**Megatron-Core-based training code**](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training)
27
+ - [**Inference pipelines**](https://github.com/Shopee-MUG/MUG-V) for **video generation** and **video enhancement**
28
+
29
+ To our knowledge, this is the first publicly available large-scale video-generation training framework that leverages **Megatron-Core** for high training efficiency (e.g., high GPU utilization, strong MFU) and near-linear multi-node scaling. By open-sourcing the end-to-end framework, we aim to accelerate progress and lower the barrier for scalable modeling of the visual world.
30
+
31
+
32
+ ## 🔥 Latest News
33
+
34
+ * Otc 21, 2025: 👋 We are excited to announce the release of the **MUG-V 10B** [technical report](#). We welcome feedback and discussions.
35
+ * Otc 21, 2025: 👋 We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
36
+ * Otc 21, 2025: 👋 We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
37
+ * Otc 21, 2025: 👋 We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
38
+ * Apr 25, 2025: 👋 We submitted our model to [Vbench-I2V leaderboard](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard), at submission time, MUG-V ranked **#3**.
39
+
40
+
41
+ ## ✅ Roadmap
42
+
43
+ - **MUG-V Model & Inference**
44
+ - [x] Inference code for MUG-V 10B
45
+ - [x] Checkpoints: e-commerce edition (Image-to-Video Generation, I2V)
46
+ - [ ] Checkpoints: general-domain edition
47
+ - [ ] Diffusers integration
48
+ - [ ] Text prompt rewriter
49
+ - **MUG-V Training**
50
+ - [x] Data preprocessing tools (video encoding, text encoding)
51
+ - [x] Pre-training framework on Megatron-LM
52
+ - **MUG-V Video Enhancer**
53
+ - [x] Inference code
54
+ - [x] Light-weight I2V model Checkpoints (trained on WAN-2.1 1.3B T2V model)
55
+ - [x] UG-V Video Enhancer LoRA Checkpoints (based on above I2V model)
56
+ - [ ] Training code
57
+
58
+ ---
59
+
60
+ ## ✨ Features
61
+
62
+ - **High-quality video generation:** up to **720p**, **3–5 s** clips
63
+ - **Image-to-Video (I2V):** conditioning on a reference image
64
+ - **Flexible aspect ratios:** 16:9, 4:3, 1:1, 3:4, 9:16
65
+ - **Advanced architecture:** **MUG-DiT (≈10B parameters)** with flow-matching training
66
+
67
+ ## 📋 Table of Contents
68
+
69
+ - [Installation](#-installation)
70
+ - [Quick Start](#-quick-start)
71
+ - [API Reference](#-api-reference)
72
+ - [Video Enhancement](#-video-enhancement)
73
+ - [Model Architecture](#-model-architecture)
74
+ - [License](#-license)
75
+
76
+ ## 🛠️ Installation
77
+
78
+ ### Prerequisites
79
+
80
+ - **Python** ≥ 3.8 (tested with 3.10)
81
+ - **CUDA** 12.1
82
+ - **NVIDIA GPU** with ≥ **24 GB** VRAM (for 10B-parameter inference)
83
+
84
+ ### Install Dependencies
85
+
86
+ ```bash
87
+ # Clone the repository
88
+ git clone https://github.com/Shopee-MUG/MUG-V
89
+ cd MUG-V
90
+
91
+ # Install required packages
92
+ conda create -n mug_infer python=3.10 -y
93
+ conda activate mug_infer
94
+ pip3 install -e .
95
+ pip3 install flash_attn --no-build-isolation
96
+ ```
97
+
98
+ ### Download Models
99
+
100
+ You need to download the pre-trained models by huggingface-cli
101
+
102
+ ```bash
103
+ # install huggingface-cli
104
+ pip3 install -U huggingface_hub[cli]
105
+
106
+ # login to your account with the access token
107
+ huggingface-cli login
108
+
109
+ # download MUG-DiT-10B pretrained models for inference
110
+ huggingface-cli download MUG-V/MUG-V-inference --local-dir ./pretrained_ckpt/MUG-V-inference
111
+
112
+ # download wan2.1-t2v vae&text encoder models for enhancer
113
+ huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./pretrained_ckpt/Wan2.1-T2V-1.3B
114
+ ```
115
+
116
+ Update the vae and dit model paths in your configuration at infer_pipeline.MUGDiTConfig.
117
+
118
+ ## 🚀 Quick Start
119
+
120
+ ### Basic Usage
121
+
122
+ ```python
123
+ from infer_pipeline import MUGDiTPipeline, MUGDiTConfig
124
+
125
+ # Initialize the pipeline
126
+ config = MUGDiTConfig()
127
+ pipeline = MUGDiTPipeline(config)
128
+
129
+ # Generate a video
130
+ output_path = pipeline.generate(
131
+ prompt="This video describes a young woman standing in a minimal studio with a warm beige backdrop, wearing a white cropped top with thin straps and a matching long tiered skirt. She faces the camera directly with a relaxed posture, and the lighting is bright and even, giving the scene a soft, neutral appearance. The background features a seamless beige wall and a smooth floor with no additional props, creating a simple setting that keeps attention on the outfit. The main subject is a woman with long curly hair, dressed in a white spaghetti-strap crop top and a flowing ankle-length skirt with gathered tiers. She wears black strappy sandals and is positioned centrally in the frame, standing upright with her arms resting naturally at her sides. The camera is stationary and straight-on, capturing a full-body shot that keeps her entire figure visible from head to toe. She appears to hold a calm expression while breathing steadily, occasionally shifting her weight slightly from one foot to the other. There may be a subtle tilt of the head or a gentle adjustment of her hands, but movements remain small and unhurried throughout the video. The background remains static with no visible changes, and the framing stays consistent for a clear view of the outfit details.",
132
+ reference_image_path="./assets/sample.png",
133
+ output_path="outputs/sample.mp4"
134
+ )
135
+
136
+ print(f"Video saved to: {output_path}")
137
+ ```
138
+
139
+ ### Command Line Usage
140
+
141
+ ```bash
142
+ python3 infer_pipeline.py
143
+ ```
144
+
145
+ The script will use the default configuration and generate a video based on the built-in prompt and reference image.
146
+
147
+ ### Video Enhancement
148
+
149
+ Use the MUG-V Video Enhancer to improve videos generated by MUG-DiT-10B (e.g., detail restoration, temporal consistency). Details can be find in the [./mug_enhancer folder](mug_enhancer/).
150
+ ```bash
151
+ cd ./mug_enhancer
152
+ python3 predict.py \
153
+ --task predict \
154
+ --output_path ./video_outputs \
155
+ --num_frames 105 \
156
+ --height 1280 \
157
+ --width 720 \
158
+ --fps=20 \
159
+ --video_path "../outputs/" \
160
+ --val_dataset_path "../assets/sample.csv" \
161
+ --lora_rank 256
162
+ ```
163
+ The output video will be saved to `./mug_enhancer/video_outputs/year-month-day_hour\:minute\:second/0000_generated_video_enhance.mp4`.
164
+
165
+ ## 🔧 API Reference
166
+
167
+ ### MUGDiTConfig
168
+
169
+ Configuration class for the MUG-DiT-10B pipeline.
170
+
171
+ **Parameters:**
172
+ - `device` (str): Device to run inference on. Default: "cuda"
173
+ - `dtype` (torch.dtype): Data type for computations. Default: torch.bfloat16
174
+ - `vae_pretrained_path` (str): Path to VAE model checkpoint
175
+ - `dit_pretrained_path` (str): Path to DiT model checkpoint
176
+ - `resolution` (str): Video resolution. Currently only "720p" is supported
177
+ - `video_length` (str): Video duration. Options: "3s", "5s"
178
+ - `video_ar_ratio` (str): Aspect ratio. Options: "16:9", "4:3", "1:1", "3:4", "9:16"
179
+ - `cfg_scale` (float): Classifier-free guidance scale. Default: 4.0
180
+ - `num_sampling_steps` (int): Number of denoising steps. Default: 25
181
+ - `fps` (int): Frames per second. Default: 30
182
+ - `aes_score` (float): Aesthetic score for prompt enhancement. Default: 6.0
183
+ - `seed` (int): Random seed for reproducibility. Default: 42
184
+
185
+ ### MUGDiTPipeline
186
+
187
+ Main inference pipeline class.
188
+
189
+ #### Methods
190
+
191
+ ##### `__init__(config: MUGDiTConfig)`
192
+ Initialize the pipeline with configuration.
193
+
194
+ ##### `generate(prompt=None, reference_image_path=None, output_path=None, seed=None, **kwargs) -> str`
195
+ Generate a video from text and reference image.
196
+
197
+ **Parameters:**
198
+ - `prompt` (str, optional): Text description of desired video
199
+ - `reference_image_path` (str|Path, optional): Path to reference image
200
+ - `output_path` (str|Path, optional): Output video file path
201
+ - `seed` (int, optional): Random seed for this generation
202
+
203
+ **Returns:**
204
+ - `str`: Path to generated video file
205
+
206
+ ## 🏗️ Model Architecture
207
+
208
+ MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matching objectives:
209
+
210
+
211
+ ```mermaid
212
+ flowchart TB
213
+ A[Input Video] --> B[VideoVAE Encoder]
214
+ B --> C["Latent 8×8×8 compression"]
215
+
216
+ C --> D["3D Patch 2x2x2 Embedding"]
217
+ D --> E["MUGDiT Blocks x 56"]
218
+
219
+ F[Text] --> G[Caption Encoder]
220
+ G --> E
221
+
222
+ H[Timestep] --> E
223
+ I[Size Info] --> E
224
+
225
+ E --> J[Output Projection]
226
+ J --> K[VideoVAE Decoder]
227
+ K --> L[Generated Video]
228
+
229
+ style E fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
230
+ style C fill:#fff4e6,stroke:#ff9800,stroke-width:2px
231
+ style L fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
232
+ ```
233
+
234
+ #### Core Components
235
+
236
+ 1. **VideoVAE**: 8×8×8 spatiotemporal compression
237
+ - Encoder: 3D convolutions + temporal attention
238
+ - Decoder: 3D transposed convolutions + temporal upsampling
239
+ - KL regularization for stable latent space
240
+
241
+ 2. **3D Patch Embedding**: Converts video latents to tokens
242
+ - Patch size: 2×2×2 (non-overlapping)
243
+ - Final compression: ~2048× vs. pixel space
244
+
245
+ 3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
246
+ - Extends 2D RoPE to handle temporal dimension
247
+ - Frequency-based encoding for spatiotemporal modeling
248
+
249
+ 4. **Conditioning Modules**:
250
+ - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
251
+ - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
252
+ - **Size Embedder**: Handles variable resolution inputs
253
+
254
+ 5. **MUGDiT Transformer Block**:
255
+
256
+ ```mermaid
257
+ graph LR
258
+ A[Input] --> B[AdaLN]
259
+ B --> C[Self-Attn<br/>QK-Norm]
260
+ C --> D[Gate]
261
+ D --> E1[+]
262
+ A --> E1
263
+
264
+ E1 --> F[LayerNorm]
265
+ F --> G[Cross-Attn<br/>QK-Norm]
266
+ G --> E2[+]
267
+ E1 --> E2
268
+
269
+ E2 --> I[AdaLN]
270
+ I --> J[MLP]
271
+ J --> K[Gate]
272
+ K --> E3[+]
273
+ E2 --> E3
274
+
275
+ E3 --> L[Output]
276
+
277
+ M[Timestep<br/>Size Info] -.-> B
278
+ M -.-> I
279
+
280
+ N[Text] -.-> G
281
+
282
+ style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
283
+ style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
284
+ style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
285
+ style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
286
+ style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
287
+ style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
288
+ ```
289
+
290
+ 6. **Rectified Flow Scheduler**:
291
+ - More stable training than DDPM
292
+ - Logit-normal timestep sampling
293
+ - Linear interpolation between noise and data
294
+
295
+ ## Citation
296
+ If you find our work helpful, please cite us.
297
+
298
+ ```
299
+ @article{mug-v2025,
300
+ title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
301
+ author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
302
+ journal = {arXiv preprint},
303
+ year={2025}
304
+ }
305
+ ```
306
+
307
+ ## 📄 License
308
+
309
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/Shopee-MUG/MUG-V/blob/main/LICENSE) file for details.
310
+
311
+ **Note**: This is a research project. Generated content may not always be perfect. Please use responsibly and in accordance with applicable laws and regulations.
312
+
313
+
314
+ ## Acknowledgements
315
+
316
+ We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.
317
+
318
+
319
+