license: apache-2.0
pipeline_tag: any-to-any
library_name: diffusers
tags:
- many-for-many
- diffusion-model
- video-generation
- image-generation
- text-to-video
- image-to-video
- video-to-video
- image-manipulation
- video-manipulation
Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
\ud83d\udcda Paper | \ud83c\udf10 Project Page | \ud83d\udcbb Code | \ud83e\udd17 Model
Many-for-Many (MfM) is a novel unified framework designed to train a single model capable of performing over 10 different visual generation and manipulation tasks, encompassing both images and videos. This approach addresses the high cost of training strong text-to-video foundation models by leveraging diverse existing datasets across various tasks.
Specifically, MfM designs a lightweight adapter to unify different conditions across tasks and employs a joint image-video learning strategy to progressively train the model from scratch. This leads to a unified visual generation and manipulation model with improved video generation performance. Additionally, depth maps are introduced as a condition to help the model better perceive 3D space in visual generation.
Two versions of the model are available (8B and 2B), each capable of performing a wide array of tasks. The 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines.
\u2728 Key Features
- Unified Framework: Trains a single model for over 10 different image and video generation and manipulation tasks.
- Efficient Design: Utilizes a lightweight adapter to unify diverse conditions and a joint image-video learning strategy for progressive training.
- Depth-Aware Generation: Incorporates depth maps as a condition to enhance the model's perception of 3D space.
- Versatile Capabilities: Supports tasks like text-to-video (T2V), image-to-video (I2V), video-to-video (V2V), and various image/video manipulation.
- Competitive Performance: The 8B model delivers highly competitive results in video generation.
\ud83d\udd25 Latest News
- Inference code and model weights has been released, have fun with MfM ⭐⭐.
\ud83d\ude80 Inference
1. Install the requirements
pip install -r requirements.txt
Note: The requirements.txt file and infer_mfm_pipeline.py script can be found in the original GitHub repository.
2. Download the pipeline from Hugging Face
from huggingface_hub import snapshot_download
# For the 8B model:
snapshot_download(repo_id="LetsThink/MfM-Pipeline-8B", local_dir="your_local_path/MfM-Pipeline-8B")
# For the 2B model:
# snapshot_download(repo_id="LetsThink/MfM-Pipeline-2B", local_dir="your_local_path/MfM-Pipeline-2B")
3. Run Inference
You can refer to the inference script in scripts/inference.sh from the cloned GitHub repository. Replace PIPELINE_PATH with the local directory where you downloaded the model.
Example for text-to-video (T2V) generation:
PIPELINE_PATH=your_local_path/MfM-Pipeline-8B # or your_local_path/MfM-Pipeline-2B
OUTPUT_DIR=outputs
TASK=t2v # Change task for different applications (e.g., i2v, v2v, inpaint)
python infer_mfm_pipeline.py \
--pipeline_path $PIPELINE_PATH \
--output_dir $OUTPUT_DIR \
--task $TASK \
--crop_type keep_res \
--num_inference_steps 30 \
--guidance_scale 9 \
--motion_score 5 \
--num_samples 1 \
--upscale 4 \
--noise_aug_strength 0.0 \
--t2v_inputs your_prompt.txt # Path to a text file with your prompts
\ud83d\uddbc\ufe0f Visual Results
\ud83d\udcfa Demo Video
\ud83d\udcee Architecture
\u270d\ufe0f Citation
If you find our code or model useful in your research, please cite:
@article{yang2025MfM,
title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks},
author={Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang},
year={2025},
booktitle={arXiv preprint arXiv:2506.01758},
}