---
pipeline_tag: any-to-any
library_name: diffusers
license: apache-2.0
---

# Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

<div align="center">
  <img src="https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/MfM_logo.jpeg" alt="MfM-logo" width="50%">
</div>

**Many-for-Many (MfM)** is a unified framework introduced in the paper [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758). This framework leverages available training data from many different visual generation and manipulation tasks to train a single model for those tasks.

MfM utilizes a lightweight adapter to unify diverse conditions across different tasks and employs a joint image-video learning strategy for progressive training from scratch. This approach leads to a unified visual generation and manipulation model with improved video generation performance. The model also integrates depth maps as a condition to enhance its perception of 3D space in visual generation.

Two versions of the model (8B and 2B parameters) are available, each capable of performing more than 10 different tasks, including text-to-video (T2V), image-to-video (I2V), video-to-video (V2V), and various image and video manipulation tasks. The 8B model demonstrates highly competitive performance in video generation.

*   **Paper:** [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758)
*   **Project Page:** [https://leeruibin.github.io/MfMPage/](https://leeruibin.github.io/MfMPage/)
*   **Code:** [https://github.com/SandAI-org/MAGI-1](https://github.com/SandAI-org/MAGI-1)

## Visual Results

<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/visual_result.png'>

## Demo Video

<div align="center">
  <video src="https://github.com/user-attachments/assets/f1ddd1fd-1c2b-44e7-94dc-9f62963ab147" width="70%" controls> </video>
</div>

## Architecture

<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/arch.png'>

## Usage

You can load the model using the `diffusers` library and perform various generation tasks.

First, ensure you have the necessary requirements installed:

```bash
pip install -r requirements.txt
```

Then, you can download the pipeline from Hugging Face Hub and use it for inference:

```python
from huggingface_hub import snapshot_download
from diffusers import DiffusionPipeline
import torch
import os

# Define a local directory to download the model
local_dir = "./MfM-Pipeline-8B"

# Download the pipeline from Hugging Face Hub
# You can use "LetsThink/MfM-Pipeline-2B" for the 2B version
snapshot_download(repo_id="LetsThink/MfM-Pipeline-8B", local_dir=local_dir)

# Load the pipeline. Since MfMPipeline is a custom class, we need trust_remote_code=True.
pipe = DiffusionPipeline.from_pretrained(local_dir, torch_dtype=torch.float16, trust_remote_code=True)
pipe.to("cuda") # or your preferred device like "cpu"

# Example: Text-to-Video generation (task="t2v")
prompt = "A majestic eagle flying over snow-capped mountains."
output_dir = "outputs"
task = "t2v" # The model supports multiple tasks like "t2v", "i2v", "i2i", etc.

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Run inference
# Parameters like num_frames, num_inference_steps, guidance_scale, motion_score
# are crucial and may vary per task. Refer to the official GitHub repository
# for recommended values and detailed usage for different tasks.
video_frames = pipe(
    prompt=prompt,
    task=task,
    crop_type="keep_res",
    num_inference_steps=30,
    guidance_scale=9,
    motion_score=5,
    num_samples=1,
    upscale=4,
    noise_aug_strength=0.0,
    # t2v_inputs expects a path to a file with prompts, here we pass prompt directly.
    # For full functionality as in infer_mfm_pipeline.py, you might need to adapt.
).images[0] # The pipeline returns a list of generated results, take the first one

# You can save the video frames as a GIF or MP4 using libraries like imageio or moviepy
# Example using imageio (install with: pip install imageio imageio-ffmpeg)
# import imageio
# output_video_path = os.path.join(output_dir, "generated_video.mp4")
# imageio.mimsave(output_video_path, video_frames, fps=8)
# print(f"Generated video saved to {output_video_path}")
```

## Citation

If you find our code or model useful in your research, please cite:

```bibtex
@article{yang2025MfM,
  title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks},
  author={Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang},
  year={2025},
  booktitle={arXiv preprint arXiv:2506.01758},
}
```