MotionAgent / README.md
leoisufa's picture
Upload folder using huggingface_hub
2479ea7 verified
<div align="center">
<!-- Project Title -->
<h1>
MotionAgent: Fine-grained Controllable Video Generation via<br>
Motion Field Agent
</h1>
<!-- Conference Info -->
<p><em>International Conference on Computer Vision, ICCV 2025.</em></p>
<!-- Project Badges -->
<p>
<a href="https://arxiv.org/abs/2502.03207">
<img src="https://img.shields.io/badge/arXiv-2502.03207-b31b1b.svg" alt="arXiv"/>
</a>
<a href="https://huggingface.co/leoisufa/MotionAgent">
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow.svg" alt="HuggingFace"/>
</a>
</p>
</div>
<div align="center">
<strong>Xinyao Liao<sup>1,2</sup></strong>,
<strong>Xianfang Zeng<sup>2</sup></strong>,
<strong>Liao Wang<sup>2</sup></strong>,
<strong>Gang Yu<sup>2*</sup></strong>,
<strong>Guosheng Lin<sup>1*</sup></strong>,
<strong>Chi Zhang<sup>3</sup></strong>
<br><br>
<b>
<sup>1</sup> Nanyang Technological Universityโ€ƒ
<sup>2</sup> StepFunโ€ƒ
<sup>3</sup> Westlake University
</b>
</div>
## ๐Ÿงฉ Overview
<p align="center">
<img src="assets/agent.jpg" alt="Pipeline of Motion Field Agent" width="100%">
</p>
MotionAgent is a novel framework that enables **fine-grained motion control** for text-guided image-to-video generation. At its core is a **motion field agent** that parses motion information in text prompts and converts it into explicit *object trajectories* and *camera extrinsics*. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agentโ€™s previous actions.
## ๐ŸŽฅ Demo
<p align="center">
<a href="https://www.youtube.com/watch?v=O9WW2UpXsAI" target="_blank">
<img src="https://img.youtube.com/vi/O9WW2UpXsAI/maxresdefault.jpg"
alt="MotionAgent Demo Video"
width="80%"
style="max-width:900px; border-radius:10px; box-shadow:0 0 10px rgba(0,0,0,0.15);">
</a>
<br>
<em>Click the image above to watch the full video on YouTube ๐ŸŽฌ</em>
</p>
## ๐Ÿ› ๏ธ Dependencies and Installation
Follow the steps below to set up **MotionAgent** and run the demo smoothly ๐Ÿ’ซ
### ๐Ÿ”น 1. Clone the Repository
Clone the official GitHub repository and enter the project directory:
```bash
git clone https://github.com/leoisufa/MotionAgent.git
cd MotionAgent
```
### ๐Ÿ”น 2. Environment Setup
```bash
# Create and activate conda environment
conda create -n motionagent python==3.10 -y
conda activate motionagent
# Install PyTorch with CUDA 12.4 support
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
# Install project dependencies
pip install -r requirements.txt
```
### ๐Ÿ”น 3. Install Grounded-Segment-Anything Dependencies
MotionAgent relies on external segmentation and grounding models.
Follow the steps below to install [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything):
```bash
# Navigate to models directory
cd models
# Clone the Grounded-Segment-Anything repository
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
# Enter the cloned directory
cd Grounded-Segment-Anything
# Install Segment Anything
python -m pip install -e segment_anything
# Install Grounding DINO
pip install --no-build-isolation -e GroundingDINO
```
### ๐Ÿ”น 4. Install Metric3D Dependencies
MotionAgent relies on an external monocular depth estimation model.
Follow the steps below to install [Metric3D](https://github.com/YvanYin/Metric3D):
```bash
# Navigate to models directory
cd models
# Clone the Grounded-Segment-Anything repository
git clone https://github.com/YvanYin/Metric3D.git
```
## ๐Ÿงฑ Download Models
To run **MotionAgent**, please download all pretrained and auxiliary models listed below, and organize them under the `ckpts/` directory as shown in the example structure.
### 1๏ธโƒฃ **Optical Flow ControlNet Weights**
Download from ๐Ÿ‘‰ [Hugging Face (MotionAgent)](https://huggingface.co/leoisufa/MotionAgent) and place the files in `ckpts`.
### 2๏ธโƒฃ **Stable Video Diffusion**
Download from ๐Ÿ‘‰ [Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/tree/main/ckpts/mofa/stable-video-diffusion-img2vid-xt-1-1) and save the model to `ckpts`.
### 3๏ธโƒฃ **Grounding DINO**
Download the grounding model checkpoint using the command below:
```bash
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
```
Then place it directly under `ckpts`.
### 4๏ธโƒฃ **Segment Anything**
Download the segmentation model using:
```bash
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
```
Then place it under `ckpts`.
### 5๏ธโƒฃ **Metric Depth Estimator**
Download from ๐Ÿ‘‰ [Hugging Face (Metric3d)](https://drive.google.com/file/d/1YfmvXwpWmhLg3jSxnhT7LvY0yawlXcr_/view?usp=drive_link) and place the files in `ckpts`.
### 6๏ธโƒฃ **CMP**
Download from ๐Ÿ‘‰ [Hugging Face (MOFA-Video-Hybrid/cmp)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/resolve/main/models/cmp/experiments/semiauto_annot/resnet50_vip%2Bmpii_liteflow/checkpoints/ckpt_iter_42000.pth.tar) and save the model to `models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints`.
After all downloads and installations, your ckpts folder should look like this:
```shell
ckpts/
โ”œโ”€โ”€ controlnet/
โ”œโ”€โ”€ stable-video-diffusion-img2vid-xt-1-1/
โ”œโ”€โ”€ groundingdino_swint_ogc.pth
โ”œโ”€โ”€ metric_depth_vit_small_800k.pth
โ””โ”€โ”€ sam_vit_h_4b8939.pth
```
## ๐Ÿš€ Running the Demos
```bash
python run_agent.py
```
## ๐Ÿ”— BibTeX
If you find [MotionAgent](https://arxiv.org/abs/2502.03207) useful for your research and applications, please cite using this BibTeX:
```BibTeX
@article{liao2025motionagent,
title={Motionagent: Fine-grained controllable video generation via motion field agent},
author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi},
journal={arXiv preprint arXiv:2502.03207},
year={2025}
}
```
## ๐Ÿ™ Acknowledgements
We thank the following prior art for their excellent open source work:
- [MOFA-Video](https://github.com/MyNiuuu/MOFA-Video)
- [AppAgent](https://github.com/TencentQQGYLab/AppAgent)
- [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
- [Metric3D](https://github.com/YvanYin/Metric3D)