| ---
|
| license: mit
|
| datasets:
|
| - quanhaol/MagicData
|
| base_model:
|
| - quanhaol/Wan2.2-TI2V-5B-Turbo
|
| - Wan-AI/Wan2.2-TI2V-5B
|
| tags:
|
| - image-to-video
|
| - Trajectory-Control
|
| - Fewstep-video-gen
|
| ---
|
| <br> |
| <a href="https://arxiv.org/pdf/2603.12146"><img src="https://img.shields.io/static/v1?label=Paper&message=2603.12146&color=red&logo=arxiv"></a> |
| <a href="https://quanhaol.github.io/flashmotion-site/"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green&logo=github-pages"></a> |
| <a href="https://huggingface.co/quanhaol/FlashMotion"><img src="https://img.shields.io/badge/π€_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a> |
| <a href="https://huggingface.co/datasets/quanhaol/FlashBench"><img src="https://img.shields.io/badge/π€_HuggingFace-Benchmark-ffbd45.svg" alt="HuggingFace"></a> |
|
|
| > **FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance** |
| > <br> |
| > [Quanhao Li](https://github.com/quanhaol)<sup>1</sup>, [Zhen Xing](https://chenhsing.github.io/)<sup>1</sup>, [Rui Wang](https://scholar.google.com/citations?user=116smmsAAAAJ&hl=en)<sup>1</sup>, Haidong Cao<sup>1</sup>, [Qi Dai](https://daiqi1989.github.io/)<sup>2</sup>, Daoguo Dong<sup>1</sup> and [Zuxuan Wu](https://zxwu.azurewebsites.net/)<sup>1</sup> |
| > |
| > <sup>1</sup> Fudan University; <sup>2</sup> Microsoft Research Asia |
|
|
| ## π‘ Abstract |
|
|
| Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. |
| However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. |
| While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. |
| To bridge this gap, we introduce **FlashMotion**, a novel training framework designed for few-step trajectory-controllable video generation. |
| We first train a trajectory adapter on a multi-step video generator for precise trajectory control. |
| Then, we distill the generator into a few-step version to accelerate video generation. |
| Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. |
| For evaluation, we introduce **FlashBench**, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. |
| Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency. |
|
|
|
|
| ## π£ Updates |
| - `2026/03/13` π₯π₯We released FlashMotion, including its training code, inference code, model weights and also the evaluation benchmark. |
| - `2026/02` π₯π₯π₯ FlashMotion has been accepted by CVPR2026! |
|
|
| ## π Table of Contents |
|
|
| - [π‘ Abstract](#-abstract) |
| - [π£ Updates](#-updates) |
| - [π Table of Contents](#-table-of-contents) |
| - [β
TODO List](#-todo-list) |
| - [π Installation](#-installation) |
| - [π¦ Model Weights](#-model-weights) |
| - [Folder Structure](#folder-structure) |
| - [Download Links](#download-links) |
| - [β½οΈ Dataset Prepare](#οΈ-dataset-prepare) |
| - [π Inference](#-inference) |
| - [Scripts](#scripts) |
| - [ποΈ Train](#οΈ-train) |
| - [SlowAdapter Training](#slowadapter-training) |
| - [FastGenerator Training](#fastgenerator-training) |
| - [FastAdapter Training](#fastadapter-training) |
| - [π€ Acknowledgements](#-acknowledgements) |
| - [π Contact](#-contact) |
|
|
| ## β
TODO List |
|
|
| - [x] Release our inference code and model weights |
| - [x] Release our training code |
| - [x] Release our evaluation benchmark |
|
|
| ## π Installation |
|
|
| ```bash |
| # Clone this repository. |
| git clone https://github.com/quanhaol/FlashMotion |
| cd FlashMotion |
| |
| # Install requirements |
| conda create -n flashmotion python=3.10 -y |
| conda activate flashmotion |
| pip install -r requirements.txt |
| pip install flash-attn --no-build-isolation |
| python setup.py develop |
| ``` |
|
|
| ## π¦ Model Weights |
|
|
| ### Folder Structure |
|
|
| ``` |
| FlashMotion |
| βββ ckpts |
| βββ FastGenerator |
| β βββ model.pt |
| βββ SlowAdapter |
| β βββ ResNet |
| β βββ model.pt |
| β βββ ControlNet |
| β βββ model.pt |
| βββ FastAdapter |
| β βββ ResNet |
| β βββ model.pt |
| β βββ ControlNet |
| β βββ model.pt |
| ``` |
|
|
| ### Download Links |
|
|
| Please use the following commands to download the model weights |
|
|
| ```bash |
| pip install "huggingface_hub[hf_transfer]" |
| HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download quanhaol/FlashMotion --local-dir ckpts |
| HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir wan_models/Wan2.2-TI2V-5B |
| ``` |
|
|
| ## β½οΈ Dataset Prepare |
| All three training stages of FlashMotion uses [MagicData](https://huggingface.co/datasets/quanhaol/MagicData), an open-sourced dataset built for trajectory-controllable video generation. |
| Please follow [this README](https://huggingface.co/datasets/quanhaol/MagicData) to download and extract the data in a proper path on your machine. |
|
|
| The dataset structure can be organized as follows: |
| ``` |
| MagicData |
| βββ videos |
| β βββ videoid_1.mp4 |
| β βββ videoid_2.mp4 |
| β βββ ... |
| βββ masks |
| β βββ videoid_1 |
| β β βββ annotated_frame_00000.png |
| β β βββ annotated_frame_00001.png |
| β β βββ ... |
| β βββ videoid_2 |
| β β βββ ... |
| βββ boxs |
| β βββ videoid_1 |
| β β βββ annotated_frame_00000.png |
| β β βββ annotated_frame_00001.png |
| β β βββ ... |
| β βββ videoid_2 |
| β β βββ ... |
| βββ MagicData.csv # detailed information of each video |
| ``` |
|
|
| ## π Inference |
| The Inference process requires around 42 GiB GPU memory to use the ResNet FastAdapter and 50GiB GPU memory to use the ControlNet FastAdapter, all tested on a single NVIDIA A100 GPU. |
|
|
| β‘οΈβ‘οΈβ‘οΈ It takes only 11 seconds for denoising a video using the ResNet Adapter, and around 24 seconds to denoise a video using the ControlNet Adapter. |
|
|
| ### Scripts |
|
|
| We here provide demo scripts to run both types of trajectory adapter. |
| ```bash |
| # Demo inference script of each adapter type |
| bash running_scripts/inference/i2v_control_fewstep_controlnet.sh |
| bash running_scripts/inference/i2v_control_fewstep_resnet.sh |
| ``` |
| We also provide sample input image and trajectory maps in `./assets`. |
|
|
| Feel free to replace the `--prompt`, `--image`, `--trajectory` with your customized input prompt, input image and input trajectory maps. |
| > **Note**: If you want to build your own trajectory maps, please refer to the box trajectory construction pipeline introduced in [MagicMotion](https://github.com/quanhaol/MagicMotion/tree/main/trajectory_construction#box-trajectory). |
|
|
| ## ποΈ Train |
|
|
| We here provide scripts for all three training stages of FlashMotion, including training the SlowAdapter, FastGenerator, and the FastAdapter. |
|
|
| ### SlowAdapter Training |
| In this stage, we first train the SlowAdapter using the mask annotations in MagicData, and then finetune it using bounding box as the trajectory maps conditions. |
| ```bash |
| # Demo training script of SlowAdapter |
| bash running_scripts/train/stage1_mask.sh |
| bash running_scripts/train/stage1_box.sh |
| ``` |
|
|
| ### FastGenerator Training |
| In this stage, we distill the Wan2.2-TI2V-5B model into a 4-steps image-to-video generation model, named as the FastGenerator. |
| ```bash |
| # Demo training script of FastGenerator |
| bash running_scripts/train/stage2.sh |
| ``` |
|
|
| ### FastAdapter Training |
| In this stage, we trains the FastAdapter to fit with the FastGenerator and enable few-step trajectory controllable video generation. |
| ```bash |
| # Demo training script of FastGenerator |
| bash running_scripts/train/stage3.sh |
| ``` |
|
|
| ## π€ Acknowledgements |
|
|
| We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: |
|
|
| - [Wan](https://github.com/Wan-Video/Wan2.2): An open sourced base video generation model. |
| - [Self-Forcing](https://github.com/guandeh17/Self-Forcing) and [Causvid](https://github.com/tianweiy/CausVid): Two frameworks that pioneer the field of distilling video generation methods. |
| - [MagicMotion](https://github.com/quanhaol/MagicMotion): An open source trajectory-controllable video generation framework. |
| - [Wan2.2-TI2V-5B-Turbo](https://github.com/quanhaol/Wan2.2-TI2V-5B-Turbo): An open source step distillation image-to-video generation framework that distill Wan2.2-5B-TI2V model into 4 steps. |
|
|
|
|
| Special thanks to the contributors of these libraries for their hard work and dedication! |
|
|
| ## π Contact |
|
|
| If you have any suggestions or find our work helpful, feel free to contact us |
|
|
| Email: liqh24@m.fudan.edu.cn |
|
|
| If you find our work useful, <b>please consider giving a star to this github repository and citing it</b>: |
|
|
| ```bibtex |
| @misc{li2026flashmotionfewstepcontrollablevideo, |
| title={FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance}, |
| author={Quanhao Li and Zhen Xing and Rui Wang and Haidong Cao and Qi Dai and Daoguo Dong and Zuxuan Wu}, |
| year={2026}, |
| eprint={2603.12146}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2603.12146}, |
| } |
| ``` |