File size: 9,650 Bytes
6bfb8db | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | ---
license: mit
datasets:
- quanhaol/MagicData
base_model:
- quanhaol/Wan2.2-TI2V-5B-Turbo
- Wan-AI/Wan2.2-TI2V-5B
tags:
- image-to-video
- Trajectory-Control
- Fewstep-video-gen
---
<br>
<a href="https://arxiv.org/pdf/2603.12146"><img src="https://img.shields.io/static/v1?label=Paper&message=2603.12146&color=red&logo=arxiv"></a>
<a href="https://quanhaol.github.io/flashmotion-site/"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green&logo=github-pages"></a>
<a href="https://huggingface.co/quanhaol/FlashMotion"><img src="https://img.shields.io/badge/π€_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
<a href="https://huggingface.co/datasets/quanhaol/FlashBench"><img src="https://img.shields.io/badge/π€_HuggingFace-Benchmark-ffbd45.svg" alt="HuggingFace"></a>
> **FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance**
> <br>
> [Quanhao Li](https://github.com/quanhaol)<sup>1</sup>, [Zhen Xing](https://chenhsing.github.io/)<sup>1</sup>, [Rui Wang](https://scholar.google.com/citations?user=116smmsAAAAJ&hl=en)<sup>1</sup>, Haidong Cao<sup>1</sup>, [Qi Dai](https://daiqi1989.github.io/)<sup>2</sup>, Daoguo Dong<sup>1</sup> and [Zuxuan Wu](https://zxwu.azurewebsites.net/)<sup>1</sup>
>
> <sup>1</sup> Fudan University; <sup>2</sup> Microsoft Research Asia
## π‘ Abstract
Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories.
However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead.
While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy.
To bridge this gap, we introduce **FlashMotion**, a novel training framework designed for few-step trajectory-controllable video generation.
We first train a trajectory adapter on a multi-step video generator for precise trajectory control.
Then, we distill the generator into a few-step version to accelerate video generation.
Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos.
For evaluation, we introduce **FlashBench**, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects.
Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
## π£ Updates
- `2026/03/13` π₯π₯We released FlashMotion, including its training code, inference code, model weights and also the evaluation benchmark.
- `2026/02` π₯π₯π₯ FlashMotion has been accepted by CVPR2026!
## π Table of Contents
- [π‘ Abstract](#-abstract)
- [π£ Updates](#-updates)
- [π Table of Contents](#-table-of-contents)
- [β
TODO List](#-todo-list)
- [π Installation](#-installation)
- [π¦ Model Weights](#-model-weights)
- [Folder Structure](#folder-structure)
- [Download Links](#download-links)
- [β½οΈ Dataset Prepare](#οΈ-dataset-prepare)
- [π Inference](#-inference)
- [Scripts](#scripts)
- [ποΈ Train](#οΈ-train)
- [SlowAdapter Training](#slowadapter-training)
- [FastGenerator Training](#fastgenerator-training)
- [FastAdapter Training](#fastadapter-training)
- [π€ Acknowledgements](#-acknowledgements)
- [π Contact](#-contact)
## β
TODO List
- [x] Release our inference code and model weights
- [x] Release our training code
- [x] Release our evaluation benchmark
## π Installation
```bash
# Clone this repository.
git clone https://github.com/quanhaol/FlashMotion
cd FlashMotion
# Install requirements
conda create -n flashmotion python=3.10 -y
conda activate flashmotion
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop
```
## π¦ Model Weights
### Folder Structure
```
FlashMotion
βββ ckpts
βββ FastGenerator
β βββ model.pt
βββ SlowAdapter
β βββ ResNet
β βββ model.pt
β βββ ControlNet
β βββ model.pt
βββ FastAdapter
β βββ ResNet
β βββ model.pt
β βββ ControlNet
β βββ model.pt
```
### Download Links
Please use the following commands to download the model weights
```bash
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download quanhaol/FlashMotion --local-dir ckpts
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir wan_models/Wan2.2-TI2V-5B
```
## β½οΈ Dataset Prepare
All three training stages of FlashMotion uses [MagicData](https://huggingface.co/datasets/quanhaol/MagicData), an open-sourced dataset built for trajectory-controllable video generation.
Please follow [this README](https://huggingface.co/datasets/quanhaol/MagicData) to download and extract the data in a proper path on your machine.
The dataset structure can be organized as follows:
```
MagicData
βββ videos
β βββ videoid_1.mp4
β βββ videoid_2.mp4
β βββ ...
βββ masks
β βββ videoid_1
β β βββ annotated_frame_00000.png
β β βββ annotated_frame_00001.png
β β βββ ...
β βββ videoid_2
β β βββ ...
βββ boxs
β βββ videoid_1
β β βββ annotated_frame_00000.png
β β βββ annotated_frame_00001.png
β β βββ ...
β βββ videoid_2
β β βββ ...
βββ MagicData.csv # detailed information of each video
```
## π Inference
The Inference process requires around 42 GiB GPU memory to use the ResNet FastAdapter and 50GiB GPU memory to use the ControlNet FastAdapter, all tested on a single NVIDIA A100 GPU.
β‘οΈβ‘οΈβ‘οΈ It takes only 11 seconds for denoising a video using the ResNet Adapter, and around 24 seconds to denoise a video using the ControlNet Adapter.
### Scripts
We here provide demo scripts to run both types of trajectory adapter.
```bash
# Demo inference script of each adapter type
bash running_scripts/inference/i2v_control_fewstep_controlnet.sh
bash running_scripts/inference/i2v_control_fewstep_resnet.sh
```
We also provide sample input image and trajectory maps in `./assets`.
Feel free to replace the `--prompt`, `--image`, `--trajectory` with your customized input prompt, input image and input trajectory maps.
> **Note**: If you want to build your own trajectory maps, please refer to the box trajectory construction pipeline introduced in [MagicMotion](https://github.com/quanhaol/MagicMotion/tree/main/trajectory_construction#box-trajectory).
## ποΈ Train
We here provide scripts for all three training stages of FlashMotion, including training the SlowAdapter, FastGenerator, and the FastAdapter.
### SlowAdapter Training
In this stage, we first train the SlowAdapter using the mask annotations in MagicData, and then finetune it using bounding box as the trajectory maps conditions.
```bash
# Demo training script of SlowAdapter
bash running_scripts/train/stage1_mask.sh
bash running_scripts/train/stage1_box.sh
```
### FastGenerator Training
In this stage, we distill the Wan2.2-TI2V-5B model into a 4-steps image-to-video generation model, named as the FastGenerator.
```bash
# Demo training script of FastGenerator
bash running_scripts/train/stage2.sh
```
### FastAdapter Training
In this stage, we trains the FastAdapter to fit with the FastGenerator and enable few-step trajectory controllable video generation.
```bash
# Demo training script of FastGenerator
bash running_scripts/train/stage3.sh
```
## π€ Acknowledgements
We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
- [Wan](https://github.com/Wan-Video/Wan2.2): An open sourced base video generation model.
- [Self-Forcing](https://github.com/guandeh17/Self-Forcing) and [Causvid](https://github.com/tianweiy/CausVid): Two frameworks that pioneer the field of distilling video generation methods.
- [MagicMotion](https://github.com/quanhaol/MagicMotion): An open source trajectory-controllable video generation framework.
- [Wan2.2-TI2V-5B-Turbo](https://github.com/quanhaol/Wan2.2-TI2V-5B-Turbo): An open source step distillation image-to-video generation framework that distill Wan2.2-5B-TI2V model into 4 steps.
Special thanks to the contributors of these libraries for their hard work and dedication!
## π Contact
If you have any suggestions or find our work helpful, feel free to contact us
Email: liqh24@m.fudan.edu.cn
If you find our work useful, <b>please consider giving a star to this github repository and citing it</b>:
```bibtex
@misc{li2026flashmotionfewstepcontrollablevideo,
title={FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance},
author={Quanhao Li and Zhen Xing and Rui Wang and Haidong Cao and Qi Dai and Daoguo Dong and Zuxuan Wu},
year={2026},
eprint={2603.12146},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.12146},
}
``` |