Create README.md

6bfb8db verified 2 months ago

9.65 kB

	---
	license: mit
	datasets:
	- quanhaol/MagicData
	base_model:
	- quanhaol/Wan2.2-TI2V-5B-Turbo
	- Wan-AI/Wan2.2-TI2V-5B
	tags:
	- image-to-video
	- Trajectory-Control
	- Fewstep-video-gen
	---
	<br>
	<a href="https://arxiv.org/pdf/2603.12146"><img src="https://img.shields.io/static/v1?label=Paper&message=2603.12146&color=red&logo=arxiv"></a>
	<a href="https://quanhaol.github.io/flashmotion-site/"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green&logo=github-pages"></a>
	<a href="https://huggingface.co/quanhaol/FlashMotion"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
	<a href="https://huggingface.co/datasets/quanhaol/FlashBench"><img src="https://img.shields.io/badge/🤗_HuggingFace-Benchmark-ffbd45.svg" alt="HuggingFace"></a>

	> FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
	> <br>
	> [Quanhao Li](https://github.com/quanhaol)<sup>1</sup>, [Zhen Xing](https://chenhsing.github.io/)<sup>1</sup>, [Rui Wang](https://scholar.google.com/citations?user=116smmsAAAAJ&hl=en)<sup>1</sup>, Haidong Cao<sup>1</sup>, [Qi Dai](https://daiqi1989.github.io/)<sup>2</sup>, Daoguo Dong<sup>1</sup> and [Zuxuan Wu](https://zxwu.azurewebsites.net/)<sup>1</sup>
	>
	> <sup>1</sup> Fudan University; <sup>2</sup> Microsoft Research Asia

	## 💡 Abstract

	Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories.
	However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead.
	While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy.
	To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation.
	We first train a trajectory adapter on a multi-step video generator for precise trajectory control.
	Then, we distill the generator into a few-step version to accelerate video generation.
	Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos.
	For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects.
	Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.


	## 📣 Updates
	- `2026/03/13` 🔥🔥We released FlashMotion, including its training code, inference code, model weights and also the evaluation benchmark.
	- `2026/02` 🔥🔥🔥 FlashMotion has been accepted by CVPR2026!

	## 📑 Table of Contents

	- [💡 Abstract](#-abstract)
	- [📣 Updates](#-updates)
	- [📑 Table of Contents](#-table-of-contents)
	- [✅ TODO List](#-todo-list)
	- [🐍 Installation](#-installation)
	- [📦 Model Weights](#-model-weights)
	- [Folder Structure](#folder-structure)
	- [Download Links](#download-links)
	- [⛽️ Dataset Prepare](#️-dataset-prepare)
	- [🔄 Inference](#-inference)
	- [Scripts](#scripts)
	- [🏎️ Train](#️-train)
	- [SlowAdapter Training](#slowadapter-training)
	- [FastGenerator Training](#fastgenerator-training)
	- [FastAdapter Training](#fastadapter-training)
	- [🤝 Acknowledgements](#-acknowledgements)
	- [📚 Contact](#-contact)

	## ✅ TODO List

	- [x] Release our inference code and model weights
	- [x] Release our training code
	- [x] Release our evaluation benchmark

	## 🐍 Installation

	```bash
	# Clone this repository.
	git clone https://github.com/quanhaol/FlashMotion
	cd FlashMotion

	# Install requirements
	conda create -n flashmotion python=3.10 -y
	conda activate flashmotion
	pip install -r requirements.txt
	pip install flash-attn --no-build-isolation
	python setup.py develop
	```

	## 📦 Model Weights

	### Folder Structure

	```
	FlashMotion
	└── ckpts
	├── FastGenerator
	│ ├── model.pt
	├── SlowAdapter
	│ ├── ResNet
	│ └── model.pt
	│ ├── ControlNet
	│ └── model.pt
	├── FastAdapter
	│ ├── ResNet
	│ └── model.pt
	│ ├── ControlNet
	│ └── model.pt
	```

	### Download Links

	Please use the following commands to download the model weights

	```bash
	pip install "huggingface_hub[hf_transfer]"
	HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download quanhaol/FlashMotion --local-dir ckpts
	HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir wan_models/Wan2.2-TI2V-5B
	```

	## ⛽️ Dataset Prepare
	All three training stages of FlashMotion uses [MagicData](https://huggingface.co/datasets/quanhaol/MagicData), an open-sourced dataset built for trajectory-controllable video generation.
	Please follow [this README](https://huggingface.co/datasets/quanhaol/MagicData) to download and extract the data in a proper path on your machine.

	The dataset structure can be organized as follows:
	```
	MagicData
	├── videos
	│ ├── videoid_1.mp4
	│ ├── videoid_2.mp4
	│ ├── ...
	├── masks
	│ ├── videoid_1
	│ │ ├── annotated_frame_00000.png
	│ │ ├── annotated_frame_00001.png
	│ │ ├── ...
	│ ├── videoid_2
	│ │ ├── ...
	├── boxs
	│ ├── videoid_1
	│ │ ├── annotated_frame_00000.png
	│ │ ├── annotated_frame_00001.png
	│ │ ├── ...
	│ ├── videoid_2
	│ │ ├── ...
	├── MagicData.csv # detailed information of each video
	```

	## 🔄 Inference
	The Inference process requires around 42 GiB GPU memory to use the ResNet FastAdapter and 50GiB GPU memory to use the ControlNet FastAdapter, all tested on a single NVIDIA A100 GPU.

	⚡️⚡️⚡️ It takes only 11 seconds for denoising a video using the ResNet Adapter, and around 24 seconds to denoise a video using the ControlNet Adapter.

	### Scripts

	We here provide demo scripts to run both types of trajectory adapter.
	```bash
	# Demo inference script of each adapter type
	bash running_scripts/inference/i2v_control_fewstep_controlnet.sh
	bash running_scripts/inference/i2v_control_fewstep_resnet.sh
	```
	We also provide sample input image and trajectory maps in `./assets`.

	Feel free to replace the `--prompt`, `--image`, `--trajectory` with your customized input prompt, input image and input trajectory maps.
	> Note: If you want to build your own trajectory maps, please refer to the box trajectory construction pipeline introduced in [MagicMotion](https://github.com/quanhaol/MagicMotion/tree/main/trajectory_construction#box-trajectory).

	## 🏎️ Train

	We here provide scripts for all three training stages of FlashMotion, including training the SlowAdapter, FastGenerator, and the FastAdapter.

	### SlowAdapter Training
	In this stage, we first train the SlowAdapter using the mask annotations in MagicData, and then finetune it using bounding box as the trajectory maps conditions.
	```bash
	# Demo training script of SlowAdapter
	bash running_scripts/train/stage1_mask.sh
	bash running_scripts/train/stage1_box.sh
	```

	### FastGenerator Training
	In this stage, we distill the Wan2.2-TI2V-5B model into a 4-steps image-to-video generation model, named as the FastGenerator.
	```bash
	# Demo training script of FastGenerator
	bash running_scripts/train/stage2.sh
	```

	### FastAdapter Training
	In this stage, we trains the FastAdapter to fit with the FastGenerator and enable few-step trajectory controllable video generation.
	```bash
	# Demo training script of FastGenerator
	bash running_scripts/train/stage3.sh
	```

	## 🤝 Acknowledgements

	We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

	- [Wan](https://github.com/Wan-Video/Wan2.2): An open sourced base video generation model.
	- [Self-Forcing](https://github.com/guandeh17/Self-Forcing) and [Causvid](https://github.com/tianweiy/CausVid): Two frameworks that pioneer the field of distilling video generation methods.
	- [MagicMotion](https://github.com/quanhaol/MagicMotion): An open source trajectory-controllable video generation framework.
	- [Wan2.2-TI2V-5B-Turbo](https://github.com/quanhaol/Wan2.2-TI2V-5B-Turbo): An open source step distillation image-to-video generation framework that distill Wan2.2-5B-TI2V model into 4 steps.


	Special thanks to the contributors of these libraries for their hard work and dedication!

	## 📚 Contact

	If you have any suggestions or find our work helpful, feel free to contact us

	Email: liqh24@m.fudan.edu.cn

	If you find our work useful, <b>please consider giving a star to this github repository and citing it</b>:

	```bibtex
	@misc{li2026flashmotionfewstepcontrollablevideo,
	title={FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance},
	author={Quanhao Li and Zhen Xing and Rui Wang and Haidong Cao and Qi Dai and Daoguo Dong and Zuxuan Wu},
	year={2026},
	eprint={2603.12146},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2603.12146},
	}
	```