AAD-1 / README.md

Use video world model wording

a8a7ccc verified about 9 hours ago

5.68 kB

	---
	license: cc-by-nc-sa-4.0
	library_name: pytorch
	pipeline_tag: image-to-video
	tags:
	- image-to-video
	- video-generation
	- autoregressive-video-generation
	- one-step-generation
	- adversarial-distillation
	- wan
	base_model:
	- Wan-AI/Wan2.1-T2V-14B
	---

	# AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

	<p align="center">
	<a href="https://github.com/HaobroLi">Haobo Li</a><sup>1,2</sup> ·
	<a href="https://zengyh1900.github.io/">Yanhong Zeng</a><sup>2,3,✉</sup> ·
	<a href="https://github.com/JaydenLyh">Yunhong Lu</a><sup>4,2</sup> ·
	<a href="https://github.com/zhujiapeng">Jiapeng Zhu</a><sup>2</sup> ·
	<a href="https://ken-ouyang.github.io/">Hao Ouyang</a><sup>2</sup> ·
	<a href="https://github.com/qiuyu96">Qiuyu Wang</a><sup>2</sup> ·
	<a href="https://felixcheng97.github.io/">Ka Leong Cheng</a><sup>2</sup> ·
	<a href="https://shenyujun.github.io/">Yujun Shen</a><sup>2</sup> ·
	<a href="https://zhipengzhang.cn/">Zhipeng Zhang</a><sup>1,5,✉</sup>
	</p>

	<p align="center">
	<sup>1</sup>AutoLab, SAI, SJTU
	<sup>2</sup>Ant Group
	<sup>3</sup>Department of Automation, Tsinghua University
	<sup>4</sup>Zhejiang University
	<sup>5</sup>Anyverse Dynamics
	</p>

	<h2 align="center">
	<a href="https://arxiv.org/abs/2606.03972">📄 Paper</a> \|
	<a href="https://aad-1.github.io/">🌐 Website</a> \|
	<a href="https://huggingface.co/Watay/AAD-1">🤗 Models</a>
	</h2>

	We present AAD-1, an Asymmetric Adversarial Distillation framework for one-step autoregressive video world model generation. AAD-1 addresses motion collapse and training instability by combining an asymmetric generator-discriminator design with phased training: the generator remains causal for autoregressive sampling, while a bidirectional video-level discriminator scores full spatiotemporal sequences to detect global temporal failures and long-range drift. A distribution-matching warmup first bootstraps a stable one-step generator before adversarial distillation, enabling state-of-the-art one-step autoregressive video generation on VBench.

	![AAD-1 training pipeline](assets/training_pipeline.png)

	AAD-1 trains a one-step autoregressive generator in three stages. Stage I adapts a pretrained bidirectional video model into a causal generator with ODE initialization. Stage II performs one-step DMD warmup under self-rollout training. Stage III applies asymmetric adversarial refinement: the generator remains causal, while a bidirectional video-level discriminator observes full-video context to penalize temporal drift and motion collapse.

	## Progress

	- [x] 📝 Technical Report / Paper
	- [x] 🌐 Project Homepage
	- [x] 💻 Inference Code
	- [x] 🤗 Pretrained Checkpoints

	## Setup

	Clone the repository:

	```bash
	git clone https://github.com/AutoLab-SAI-SJTU/AAD-1.git
	cd AAD-1
	```

	Install with `uv`:

	```bash
	uv venv --python 3.10
	source .venv/bin/activate
	uv pip install -r requirements.txt
	uv pip install flash-attn --no-build-isolation
	uv pip install -e .
	```

	Alternatively, use `conda`:

	```bash
	conda create -n AAD-1 python=3.10 -y
	conda activate AAD-1
	pip install -r requirements.txt
	pip install flash-attn --no-build-isolation
	python setup.py develop
	```

	## Checkpoints

	The public release path only needs:

	1. 🤗 [Official shared Wan model: Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)
	2. 🤗 [Released AAD-1 sharded generator checkpoint](https://huggingface.co/Watay/AAD-1)

	Download the shared Wan components:

	```bash
	huggingface-cli download \
	Wan-AI/Wan2.1-T2V-14B \
	--local-dir-use-symlinks False \
	--local-dir wan_models/Wan2.1-T2V-14B
	```

	If you use a custom shared Wan path, pass it explicitly with `--wan_model_dir`.

	Download the AAD-1 sharded generator checkpoint:

	```bash
	huggingface-cli download \
	Watay/AAD-1 \
	--include "14b_i2v_1step_transformer/*" \
	--local-dir-use-symlinks False \
	--local-dir checkpoints
	```

	Optional 2-step checkpoint:

	```bash
	huggingface-cli download \
	Watay/AAD-1 \
	--include "14b_i2v_2step_transformer/*" \
	--local-dir-use-symlinks False \
	--local-dir checkpoints
	```

	## Quick Start

	Run from the repository root. This command generates an 81-frame video from an input image with the `1step` checkpoint on a single GPU.

	```bash
	python aad1/inference.py \
	--prompt "two people scuba diving in the ocean" \
	--image_path assets/examples/scuba_diving_ocean.jpg \
	--output_path outputs/aad1_scuba_1step.mp4 \
	--checkpoint_path checkpoints/14b_i2v_1step_transformer/self_forcing_generator_bf16.index.json \
	--wan_model_dir wan_models/Wan2.1-T2V-14B \
	--num_frames 81 \
	--seed 1000 \
	--denoising_timestep_list 1000
	```

	Example `2step` command:

	```bash
	python aad1/inference.py \
	--prompt "two people scuba diving in the ocean" \
	--image_path assets/examples/scuba_diving_ocean.jpg \
	--output_path outputs/aad1_scuba_2step.mp4 \
	--checkpoint_path checkpoints/14b_i2v_2step_transformer/self_forcing_generator_bf16.index.json \
	--wan_model_dir wan_models/Wan2.1-T2V-14B \
	--num_frames 81 \
	--seed 1000 \
	--denoising_timestep_list 1000,500
	```

	More examples, including 20s generation, are in [docs/inference-examples.md](docs/inference-examples.md).

	## Acknowledgements

	We thank the authors and contributors of [Wan2.1](https://github.com/Wan-Video/Wan2.1), [CausVid](https://github.com/tianweiy/CausVid), [Self Forcing](https://github.com/guandeh17/Self-Forcing), and [FastVideo](https://github.com/hao-ai-lab/FastVideo) for their open research and codebases. AAD-1 builds on these foundations for causal video generation, distillation, and efficient inference.