File size: 5,034 Bytes
0e2d8e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
pipeline_tag: any-to-any
library_name: diffusers
tags:
  - many-for-many
  - diffusion-model
  - video-generation
  - image-generation
  - text-to-video
  - image-to-video
  - video-to-video
  - image-manipulation
  - video-manipulation
---

# Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

<div align="center">
  <img src="https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/MfM_logo.jpeg" alt="MfM-logo" width="50%">
</div>

[\ud83d\udcda Paper](https://huggingface.co/papers/2506.01758) | [\ud83c\udf10 Project Page](https://leeruibin.github.io/MfMPage/) | [\ud83d\udcbb Code](https://github.com/SandAI-org/MAGI-1) | [\ud83e\udd17 Model](https://huggingface.co/LetsThink/MfM-Pipeline-8B)

**Many-for-Many (MfM)** is a novel unified framework designed to train a single model capable of performing over 10 different visual generation and manipulation tasks, encompassing both images and videos. This approach addresses the high cost of training strong text-to-video foundation models by leveraging diverse existing datasets across various tasks.

Specifically, MfM designs a lightweight adapter to unify different conditions across tasks and employs a joint image-video learning strategy to progressively train the model from scratch. This leads to a unified visual generation and manipulation model with improved video generation performance. Additionally, depth maps are introduced as a condition to help the model better perceive 3D space in visual generation.

Two versions of the model are available (8B and 2B), each capable of performing a wide array of tasks. The 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines.

## \u2728 Key Features
*   **Unified Framework**: Trains a single model for over 10 different image and video generation and manipulation tasks.
*   **Efficient Design**: Utilizes a lightweight adapter to unify diverse conditions and a joint image-video learning strategy for progressive training.
*   **Depth-Aware Generation**: Incorporates depth maps as a condition to enhance the model's perception of 3D space.
*   **Versatile Capabilities**: Supports tasks like text-to-video (T2V), image-to-video (I2V), video-to-video (V2V), and various image/video manipulation.
*   **Competitive Performance**: The 8B model delivers highly competitive results in video generation.

## \ud83d\udd25 Latest News

- Inference code and model weights has been released, have fun with MfM ⭐⭐.

## \ud83d\ude80 Inference

### 1. Install the requirements
```bash
pip install -r requirements.txt
```
*Note: The `requirements.txt` file and `infer_mfm_pipeline.py` script can be found in the original [GitHub repository](https://github.com/SandAI-org/MAGI-1).*

### 2. Download the pipeline from Hugging Face

```python
from huggingface_hub import snapshot_download

# For the 8B model:
snapshot_download(repo_id="LetsThink/MfM-Pipeline-8B", local_dir="your_local_path/MfM-Pipeline-8B")

# For the 2B model:
# snapshot_download(repo_id="LetsThink/MfM-Pipeline-2B", local_dir="your_local_path/MfM-Pipeline-2B")
```

### 3. Run Inference

You can refer to the inference script in `scripts/inference.sh` from the cloned GitHub repository. Replace `PIPELINE_PATH` with the local directory where you downloaded the model.

Example for text-to-video (T2V) generation:
```bash
PIPELINE_PATH=your_local_path/MfM-Pipeline-8B # or your_local_path/MfM-Pipeline-2B
OUTPUT_DIR=outputs
TASK=t2v # Change task for different applications (e.g., i2v, v2v, inpaint)

python infer_mfm_pipeline.py \
        --pipeline_path $PIPELINE_PATH \
        --output_dir $OUTPUT_DIR \
        --task $TASK \
        --crop_type keep_res \
        --num_inference_steps 30 \
        --guidance_scale 9 \
        --motion_score 5 \
        --num_samples 1 \
        --upscale 4 \
        --noise_aug_strength 0.0 \
        --t2v_inputs your_prompt.txt # Path to a text file with your prompts
```

## \ud83d\uddbc\ufe0f Visual Results

<div align="center">
  <img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/visual_result.png' alt="Visual Results">
</div>

## \ud83d\udcfa Demo Video

<div align="center">
  <video src="https://github.com/user-attachments/assets/f1ddd1fd-1c2b-44e7-94dc-9f62963ab147" width="70%" controls> </video>
</div>

## \ud83d\udcee Architecture

<div align="center">
  <img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/arch.png' alt="Architecture Diagram">
</div>

## \u270d\ufe0f Citation

If you find our code or model useful in your research, please cite:

```bibtex
@article{yang2025MfM,
  title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks},
  author={Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang},
  year={2025},
  booktitle={arXiv preprint arXiv:2506.01758},
}
```