File size: 9,624 Bytes
3411e13
 
e42004a
3411e13
 
e42004a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3411e13
e42004a
3411e13
e42004a
3411e13
e42004a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: mit
pipeline_tag: image-to-video
---

# MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

<a href="https://huggingface.co/papers/2503.16421"><img src="https://img.shields.io/static/v1?label=Paper&message=2503.16421&color=red&logo=arxiv"></a>
<a href="https://quanhaol.github.io/magicmotion-site/"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green&logo=github-pages"></a>
<a href="https://huggingface.co/quanhaol/MagicMotion"><img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace Model"></a>
<a href="https://huggingface.co/datasets/quanhaol/MagicData"><img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Dataset-ffbd45.svg" alt="HuggingFace Dataset"></a>

<p align="center">
  <img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/teaser2.webp" width="100%" alt="MagicMotion Teaser Image">
</p>

MagicMotion is a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality.

## Abstract

Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce **MagicMotion**, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present **MagicData**, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce **MagicBench**, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at this https URL .

<p align="center">
  <img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/teaser.webp" width="100%" alt="MagicMotion Demo Image">
</p>

## News

-   `2025/07/28` πŸ”₯πŸ”₯MagicData has been released [`here`](https://huggingface.co/datasets/quanhaol/MagicData). Welcome to use our dataset!
-   `2025/06/26` πŸ”₯πŸ”₯MagicMotion has been accepted by ICCV2025!πŸŽ‰πŸŽ‰πŸŽ‰
-   `2025/03/28` πŸ”₯πŸ”₯We released interactive demo with gradio for MagicMotion.
-   `2025/03/27` MagicMotion can now perform inference on a single 4090 GPU (with less than 24GB of GPU memory).
-   `2025/03/21` πŸ”₯πŸ”₯We released MagicMotion, including inference code and model weights.

## Installation

To get started with MagicMotion, clone the repository and install the required dependencies:

```bash
# Clone this repository.
git clone https://github.com/quanhaol/MagicMotion
cd MagicMotion

# Install requirements
conda env create -n magicmotion --file environment.yml
conda activate magicmotion
pip install git+https://github.com/huggingface/diffusers

# Install Grounded_SAM2 for trajectory construction
cd trajectory_construction/Grounded_SAM2
pip install -e .
pip install --no-build-isolation -e grounding_dino

# Optional: For image editing
pip install git+https://github.com/huggingface/image_gen_aux
```

## Model Weights

The model weights are organized into stages within the `ckpts` folder. You can download them using `huggingface-cli`:

### Folder Structure

```
MagicMotion
└── ckpts
    β”œβ”€β”€ stage1
    β”‚   β”œβ”€β”€ mask.pt
    β”œβ”€β”€ stage2
    β”‚   └── box.pt
    β”‚   └── box_perception_head.pt
    β”œβ”€β”€ stage3
    β”‚   └── sparse_box.pt
    β”‚   └── sparse_box_perception_head.pt
```

### Download Links

```bash
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download quanhaol/MagicMotion --local-dir ckpts
```

## Inference

Inference requires **only 23GB of GPU memory** (tested on a single 24GB NVIDIA GeForce RTX 4090 GPU).

If you have sufficient GPU memory, you can modify `magicmotion/inference.py` to improve runtime performance:

```python
# Optimized setting (for GPUs with sufficient memory)
pipe.to("cuda")
# pipe.enable_sequential_cpu_offload()
```
> **Note**: Using the optimized setting can reduce runtime by up to 2x.

### Python Sample Usage (Conceptual)

MagicMotion integrates with the `diffusers` library. While the full pipeline involves custom trajectory construction, here's a conceptual example of how you might use `AutoPipelineForImage2Video` with downloaded checkpoints.

```python
import torch
from diffusers import AutoPipelineForImage2Video
from PIL import Image
import os

# Ensure you have cloned the MagicMotion repository and downloaded the weights
# as per the "Installation" and "Model Weights" sections above.
# Example: If your MagicMotion folder is at './MagicMotion'
magicmotion_root = "./MagicMotion"
ckpt_path = os.path.join(magicmotion_root, "ckpts")

# Load the pipeline for a specific stage (e.g., stage 2 for box control)
# You might need to adjust `subfolder` based on the specific pipeline configuration
# in the MagicMotion project's inference logic.
# The `AutoPipelineForImage2Video` might require a specific structure if loading locally.
# Refer to the official GitHub repository for precise loading of the custom pipeline.
try:
    pipe = AutoPipelineForImage2Video.from_pretrained(
        magicmotion_root, # or a specific subfolder if a pipeline is defined there
        torch_dtype=torch.float16,
        local_files_only=True # Assumes checkpoints are downloaded locally
    )
    pipe.to("cuda") # Move to GPU if memory allows

    # Placeholder for actual inputs
    # You would load your input image (PIL Image) and generate/load trajectory conditions.
    # For example:
    # input_image = Image.open("your_input_image.png").convert("RGB")
    # trajectory_conditions = {
    #     "bboxes": [[(x1, y1, x2, y2), ...], ...] # list of bboxes per frame for each object
    # }

    # Example inference call (conceptual, exact arguments depend on MagicMotion's pipeline)
    # generated_video_frames = pipe(
    #     image=input_image,
    #     trajectory_conditions=trajectory_conditions,
    #     num_frames=25,
    #     guidance_scale=7.5,
    #     num_inference_steps=50,
    # ).images

    # print("Pipeline loaded. Please replace placeholder inputs with actual data.")

except Exception as e:
    print(f"Failed to load pipeline directly. Please refer to the official GitHub repository's `magicmotion/scripts/inference/` for detailed usage instructions and specific model loading logic: {e}")

```
For complete inference scripts and how to construct various trajectories (mask, bounding box, sparse box), please refer to the [official GitHub repository](https://github.com/quanhaol/MagicMotion) in the `magicmotion/scripts/inference` and `trajectory_construction` directories.

## Gradio Demo

An interactive Gradio demo is available, which you can run locally:

```bash
bash magicmotion/scripts/app/app.sh
```

<img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/images/gradio/1.png" alt="Gradio Demo Screenshot 1" style="width: 60%; border: 1px solid #ddd; border-radius: 4px; padding: 5px;"> <img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/images/gradio/2.png" alt="Gradio Demo Screenshot 2" style="width: 60%; border: 1px solid #ddd; border-radius: 4px; padding: 5px;">

## Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

-   [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG.
-   [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech.
-   [finetrainers](https://github.com/a-r-r-o-w/finetrainers): A Memory-optimized training library for diffusion models.

Special thanks to the contributors of these libraries for their hard work and dedication!

## Citation

If you find our work useful, **please consider giving a star to this GitHub repository and citing it**:

```bibtex
@article{li2025magicmotion,
  title={MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance},
  author={Li, Quanhao and Xing, Zhen and Wang, Rui and Zhang, Hui and Dai, Qi and Wu, Zuxuan},
  journal={arXiv preprint arXiv:2503.16421},
  year={2025}
}
```

## Contact

If you have any suggestions or find our work helpful, feel free to contact us:

Email: liqh24@m.fudan.edu.cn or zhenxingfd@gmail.com