|
|
--- |
|
|
pipeline_tag: any-to-any |
|
|
library_name: diffusers |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/MfM_logo.jpeg" alt="MfM-logo" width="50%"> |
|
|
</div> |
|
|
|
|
|
**Many-for-Many (MfM)** is a unified framework introduced in the paper [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758). This framework leverages available training data from many different visual generation and manipulation tasks to train a single model for those tasks. |
|
|
|
|
|
MfM utilizes a lightweight adapter to unify diverse conditions across different tasks and employs a joint image-video learning strategy for progressive training from scratch. This approach leads to a unified visual generation and manipulation model with improved video generation performance. The model also integrates depth maps as a condition to enhance its perception of 3D space in visual generation. |
|
|
|
|
|
Two versions of the model (8B and 2B parameters) are available, each capable of performing more than 10 different tasks, including text-to-video (T2V), image-to-video (I2V), video-to-video (V2V), and various image and video manipulation tasks. The 8B model demonstrates highly competitive performance in video generation. |
|
|
|
|
|
* **Paper:** [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758) |
|
|
* **Project Page:** [https://leeruibin.github.io/MfMPage/](https://leeruibin.github.io/MfMPage/) |
|
|
* **Code:** [https://github.com/SandAI-org/MAGI-1](https://github.com/SandAI-org/MAGI-1) |
|
|
|
|
|
## Visual Results |
|
|
|
|
|
<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/visual_result.png'> |
|
|
|
|
|
## Demo Video |
|
|
|
|
|
<div align="center"> |
|
|
<video src="https://github.com/user-attachments/assets/f1ddd1fd-1c2b-44e7-94dc-9f62963ab147" width="70%" controls> </video> |
|
|
</div> |
|
|
|
|
|
## Architecture |
|
|
|
|
|
<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/arch.png'> |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can load the model using the `diffusers` library and perform various generation tasks. |
|
|
|
|
|
First, ensure you have the necessary requirements installed: |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
Then, you can download the pipeline from Hugging Face Hub and use it for inference: |
|
|
|
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
from diffusers import DiffusionPipeline |
|
|
import torch |
|
|
import os |
|
|
|
|
|
# Define a local directory to download the model |
|
|
local_dir = "./MfM-Pipeline-8B" |
|
|
|
|
|
# Download the pipeline from Hugging Face Hub |
|
|
# You can use "LetsThink/MfM-Pipeline-2B" for the 2B version |
|
|
snapshot_download(repo_id="LetsThink/MfM-Pipeline-8B", local_dir=local_dir) |
|
|
|
|
|
# Load the pipeline. Since MfMPipeline is a custom class, we need trust_remote_code=True. |
|
|
pipe = DiffusionPipeline.from_pretrained(local_dir, torch_dtype=torch.float16, trust_remote_code=True) |
|
|
pipe.to("cuda") # or your preferred device like "cpu" |
|
|
|
|
|
# Example: Text-to-Video generation (task="t2v") |
|
|
prompt = "A majestic eagle flying over snow-capped mountains." |
|
|
output_dir = "outputs" |
|
|
task = "t2v" # The model supports multiple tasks like "t2v", "i2v", "i2i", etc. |
|
|
|
|
|
# Create output directory if it doesn't exist |
|
|
os.makedirs(output_dir, exist_ok=True) |
|
|
|
|
|
# Run inference |
|
|
# Parameters like num_frames, num_inference_steps, guidance_scale, motion_score |
|
|
# are crucial and may vary per task. Refer to the official GitHub repository |
|
|
# for recommended values and detailed usage for different tasks. |
|
|
video_frames = pipe( |
|
|
prompt=prompt, |
|
|
task=task, |
|
|
crop_type="keep_res", |
|
|
num_inference_steps=30, |
|
|
guidance_scale=9, |
|
|
motion_score=5, |
|
|
num_samples=1, |
|
|
upscale=4, |
|
|
noise_aug_strength=0.0, |
|
|
# t2v_inputs expects a path to a file with prompts, here we pass prompt directly. |
|
|
# For full functionality as in infer_mfm_pipeline.py, you might need to adapt. |
|
|
).images[0] # The pipeline returns a list of generated results, take the first one |
|
|
|
|
|
# You can save the video frames as a GIF or MP4 using libraries like imageio or moviepy |
|
|
# Example using imageio (install with: pip install imageio imageio-ffmpeg) |
|
|
# import imageio |
|
|
# output_video_path = os.path.join(output_dir, "generated_video.mp4") |
|
|
# imageio.mimsave(output_video_path, video_frames, fps=8) |
|
|
# print(f"Generated video saved to {output_video_path}") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our code or model useful in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{yang2025MfM, |
|
|
title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks}, |
|
|
author={Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang}, |
|
|
year={2025}, |
|
|
booktitle={arXiv preprint arXiv:2506.01758}, |
|
|
} |
|
|
``` |