File size: 9,624 Bytes

---
license: mit
pipeline_tag: image-to-video
---

# MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

<a href="https://huggingface.co/papers/2503.16421"><img src="https://img.shields.io/static/v1?label=Paper&message=2503.16421&color=red&logo=arxiv"></a>
<a href="https://quanhaol.github.io/magicmotion-site/"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green&logo=github-pages"></a>
<a href="https://huggingface.co/quanhaol/MagicMotion"><img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace Model"></a>
<a href="https://huggingface.co/datasets/quanhaol/MagicData"><img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Dataset-ffbd45.svg" alt="HuggingFace Dataset"></a>

<p align="center">
  <img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/teaser2.webp" width="100%" alt="MagicMotion Teaser Image">
</p>

MagicMotion is a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality.

## Abstract

Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce **MagicMotion**, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present **MagicData**, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce **MagicBench**, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at this https URL .

<p align="center">
  <img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/teaser.webp" width="100%" alt="MagicMotion Demo Image">
</p>

## News

-   `2025/07/28` 🔥🔥MagicData has been released [`here`](https://huggingface.co/datasets/quanhaol/MagicData). Welcome to use our dataset!
-   `2025/06/26` 🔥🔥MagicMotion has been accepted by ICCV2025!🎉🎉🎉
-   `2025/03/28` 🔥🔥We released interactive demo with gradio for MagicMotion.
-   `2025/03/27` MagicMotion can now perform inference on a single 4090 GPU (with less than 24GB of GPU memory).
-   `2025/03/21` 🔥🔥We released MagicMotion, including inference code and model weights.

## Installation

To get started with MagicMotion, clone the repository and install the required dependencies:

```bash
# Clone this repository.
git clone https://github.com/quanhaol/MagicMotion
cd MagicMotion

# Install requirements
conda env create -n magicmotion --file environment.yml
conda activate magicmotion
pip install git+https://github.com/huggingface/diffusers

# Install Grounded_SAM2 for trajectory construction
cd trajectory_construction/Grounded_SAM2
pip install -e .
pip install --no-build-isolation -e grounding_dino

# Optional: For image editing
pip install git+https://github.com/huggingface/image_gen_aux
```

## Model Weights

The model weights are organized into stages within the `ckpts` folder. You can download them using `huggingface-cli`:

### Folder Structure

```
MagicMotion
└── ckpts
    ├── stage1
    │   ├── mask.pt
    ├── stage2
    │   └── box.pt
    │   └── box_perception_head.pt
    ├── stage3
    │   └── sparse_box.pt
    │   └── sparse_box_perception_head.pt
```

### Download Links

```bash
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download quanhaol/MagicMotion --local-dir ckpts
```

## Inference

Inference requires **only 23GB of GPU memory** (tested on a single 24GB NVIDIA GeForce RTX 4090 GPU).

If you have sufficient GPU memory, you can modify `magicmotion/inference.py` to improve runtime performance:

```python
# Optimized setting (for GPUs with sufficient memory)
pipe.to("cuda")
# pipe.enable_sequential_cpu_offload()
```
> **Note**: Using the optimized setting can reduce runtime by up to 2x.

### Python Sample Usage (Conceptual)

MagicMotion integrates with the `diffusers` library. While the full pipeline involves custom trajectory construction, here's a conceptual example of how you might use `AutoPipelineForImage2Video` with downloaded checkpoints.

```python
import torch
from diffusers import AutoPipelineForImage2Video
from PIL import Image
import os

# Ensure you have cloned the MagicMotion repository and downloaded the weights
# as per the "Installation" and "Model Weights" sections above.
# Example: If your MagicMotion folder is at './MagicMotion'
magicmotion_root = "./MagicMotion"
ckpt_path = os.path.join(magicmotion_root, "ckpts")

# Load the pipeline for a specific stage (e.g., stage 2 for box control)
# You might need to adjust `subfolder` based on the specific pipeline configuration
# in the MagicMotion project's inference logic.
# The `AutoPipelineForImage2Video` might require a specific structure if loading locally.
# Refer to the official GitHub repository for precise loading of the custom pipeline.
try:
    pipe = AutoPipelineForImage2Video.from_pretrained(
        magicmotion_root, # or a specific subfolder if a pipeline is defined there
        torch_dtype=torch.float16,
        local_files_only=True # Assumes checkpoints are downloaded locally
    )
    pipe.to("cuda") # Move to GPU if memory allows

    # Placeholder for actual inputs
    # You would load your input image (PIL Image) and generate/load trajectory conditions.
    # For example:
    # input_image = Image.open("your_input_image.png").convert("RGB")
    # trajectory_conditions = {
    #     "bboxes": [[(x1, y1, x2, y2), ...], ...] # list of bboxes per frame for each object
    # }

    # Example inference call (conceptual, exact arguments depend on MagicMotion's pipeline)
    # generated_video_frames = pipe(
    #     image=input_image,
    #     trajectory_conditions=trajectory_conditions,
    #     num_frames=25,
    #     guidance_scale=7.5,
    #     num_inference_steps=50,
    # ).images

    # print("Pipeline loaded. Please replace placeholder inputs with actual data.")

except Exception as e:
    print(f"Failed to load pipeline directly. Please refer to the official GitHub repository's `magicmotion/scripts/inference/` for detailed usage instructions and specific model loading logic: {e}")

```
For complete inference scripts and how to construct various trajectories (mask, bounding box, sparse box), please refer to the [official GitHub repository](https://github.com/quanhaol/MagicMotion) in the `magicmotion/scripts/inference` and `trajectory_construction` directories.

## Gradio Demo

An interactive Gradio demo is available, which you can run locally:

```bash
bash magicmotion/scripts/app/app.sh
```

<img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/images/gradio/1.png" alt="Gradio Demo Screenshot 1" style="width: 60%; border: 1px solid #ddd; border-radius: 4px; padding: 5px;"> <img src="https://huggingface.co/quanhaol/MagicMotion/resolve/main/assets/images/gradio/2.png" alt="Gradio Demo Screenshot 2" style="width: 60%; border: 1px solid #ddd; border-radius: 4px; padding: 5px;">

## Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

-   [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG.
-   [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech.
-   [finetrainers](https://github.com/a-r-r-o-w/finetrainers): A Memory-optimized training library for diffusion models.

Special thanks to the contributors of these libraries for their hard work and dedication!

## Citation

If you find our work useful, **please consider giving a star to this GitHub repository and citing it**:

```bibtex
@article{li2025magicmotion,
  title={MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance},
  author={Li, Quanhao and Xing, Zhen and Wang, Rui and Zhang, Hui and Dai, Qi and Wu, Zuxuan},
  journal={arXiv preprint arXiv:2503.16421},
  year={2025}
}
```

## Contact

If you have any suggestions or find our work helpful, feel free to contact us:

Email: liqh24@m.fudan.edu.cn or zhenxingfd@gmail.com