|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: any-to-any |
|
|
library_name: diffusers |
|
|
tags: |
|
|
- many-for-many |
|
|
- diffusion-model |
|
|
- video-generation |
|
|
- image-generation |
|
|
- text-to-video |
|
|
- image-to-video |
|
|
- video-to-video |
|
|
- image-manipulation |
|
|
- video-manipulation |
|
|
--- |
|
|
|
|
|
# Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/MfM_logo.jpeg" alt="MfM-logo" width="50%"> |
|
|
</div> |
|
|
|
|
|
[\ud83d\udcda Paper](https://huggingface.co/papers/2506.01758) | [\ud83c\udf10 Project Page](https://leeruibin.github.io/MfMPage/) | [\ud83d\udcbb Code](https://github.com/SandAI-org/MAGI-1) | [\ud83e\udd17 Model](https://huggingface.co/LetsThink/MfM-Pipeline-8B) |
|
|
|
|
|
**Many-for-Many (MfM)** is a novel unified framework designed to train a single model capable of performing over 10 different visual generation and manipulation tasks, encompassing both images and videos. This approach addresses the high cost of training strong text-to-video foundation models by leveraging diverse existing datasets across various tasks. |
|
|
|
|
|
Specifically, MfM designs a lightweight adapter to unify different conditions across tasks and employs a joint image-video learning strategy to progressively train the model from scratch. This leads to a unified visual generation and manipulation model with improved video generation performance. Additionally, depth maps are introduced as a condition to help the model better perceive 3D space in visual generation. |
|
|
|
|
|
Two versions of the model are available (8B and 2B), each capable of performing a wide array of tasks. The 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. |
|
|
|
|
|
## \u2728 Key Features |
|
|
* **Unified Framework**: Trains a single model for over 10 different image and video generation and manipulation tasks. |
|
|
* **Efficient Design**: Utilizes a lightweight adapter to unify diverse conditions and a joint image-video learning strategy for progressive training. |
|
|
* **Depth-Aware Generation**: Incorporates depth maps as a condition to enhance the model's perception of 3D space. |
|
|
* **Versatile Capabilities**: Supports tasks like text-to-video (T2V), image-to-video (I2V), video-to-video (V2V), and various image/video manipulation. |
|
|
* **Competitive Performance**: The 8B model delivers highly competitive results in video generation. |
|
|
|
|
|
## \ud83d\udd25 Latest News |
|
|
|
|
|
- Inference code and model weights has been released, have fun with MfM ⭐⭐. |
|
|
|
|
|
## \ud83d\ude80 Inference |
|
|
|
|
|
### 1. Install the requirements |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
*Note: The `requirements.txt` file and `infer_mfm_pipeline.py` script can be found in the original [GitHub repository](https://github.com/SandAI-org/MAGI-1).* |
|
|
|
|
|
### 2. Download the pipeline from Hugging Face |
|
|
|
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
# For the 8B model: |
|
|
snapshot_download(repo_id="LetsThink/MfM-Pipeline-8B", local_dir="your_local_path/MfM-Pipeline-8B") |
|
|
|
|
|
# For the 2B model: |
|
|
# snapshot_download(repo_id="LetsThink/MfM-Pipeline-2B", local_dir="your_local_path/MfM-Pipeline-2B") |
|
|
``` |
|
|
|
|
|
### 3. Run Inference |
|
|
|
|
|
You can refer to the inference script in `scripts/inference.sh` from the cloned GitHub repository. Replace `PIPELINE_PATH` with the local directory where you downloaded the model. |
|
|
|
|
|
Example for text-to-video (T2V) generation: |
|
|
```bash |
|
|
PIPELINE_PATH=your_local_path/MfM-Pipeline-8B # or your_local_path/MfM-Pipeline-2B |
|
|
OUTPUT_DIR=outputs |
|
|
TASK=t2v # Change task for different applications (e.g., i2v, v2v, inpaint) |
|
|
|
|
|
python infer_mfm_pipeline.py \ |
|
|
--pipeline_path $PIPELINE_PATH \ |
|
|
--output_dir $OUTPUT_DIR \ |
|
|
--task $TASK \ |
|
|
--crop_type keep_res \ |
|
|
--num_inference_steps 30 \ |
|
|
--guidance_scale 9 \ |
|
|
--motion_score 5 \ |
|
|
--num_samples 1 \ |
|
|
--upscale 4 \ |
|
|
--noise_aug_strength 0.0 \ |
|
|
--t2v_inputs your_prompt.txt # Path to a text file with your prompts |
|
|
``` |
|
|
|
|
|
## \ud83d\uddbc\ufe0f Visual Results |
|
|
|
|
|
<div align="center"> |
|
|
<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/visual_result.png' alt="Visual Results"> |
|
|
</div> |
|
|
|
|
|
## \ud83d\udcfa Demo Video |
|
|
|
|
|
<div align="center"> |
|
|
<video src="https://github.com/user-attachments/assets/f1ddd1fd-1c2b-44e7-94dc-9f62963ab147" width="70%" controls> </video> |
|
|
</div> |
|
|
|
|
|
## \ud83d\udcee Architecture |
|
|
|
|
|
<div align="center"> |
|
|
<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/arch.png' alt="Architecture Diagram"> |
|
|
</div> |
|
|
|
|
|
## \u270d\ufe0f Citation |
|
|
|
|
|
If you find our code or model useful in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{yang2025MfM, |
|
|
title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks}, |
|
|
author={Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang}, |
|
|
year={2025}, |
|
|
booktitle={arXiv preprint arXiv:2506.01758}, |
|
|
} |
|
|
``` |