File size: 5,677 Bytes
a6b3b87 b6816fd a6b3b87 b6816fd b0c70f6 9eaa8ec b0c70f6 673907a 9eaa8ec b6816fd 673907a 5a3445c b0c70f6 673907a b6816fd 673907a a37cd5f 673907a b6816fd a8a7ccc b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 673907a b6816fd 3f3815f b6816fd 673907a b6816fd 673907a b6816fd 9807adc b6816fd 673907a b6816fd 9807adc b6816fd 9807adc b6816fd 673907a b6816fd 4dda852 b6816fd 673907a b6816fd 673907a 4dda852 673907a b6816fd 673907a b6816fd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
license: cc-by-nc-sa-4.0
library_name: pytorch
pipeline_tag: image-to-video
tags:
- image-to-video
- video-generation
- autoregressive-video-generation
- one-step-generation
- adversarial-distillation
- wan
base_model:
- Wan-AI/Wan2.1-T2V-14B
---
# AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation
<p align="center">
<a href="https://github.com/HaobroLi">Haobo Li</a><sup>1,2</sup> 路
<a href="https://zengyh1900.github.io/">Yanhong Zeng</a><sup>2,3,✉</sup> 路
<a href="https://github.com/JaydenLyh">Yunhong Lu</a><sup>4,2</sup> 路
<a href="https://github.com/zhujiapeng">Jiapeng Zhu</a><sup>2</sup> 路
<a href="https://ken-ouyang.github.io/">Hao Ouyang</a><sup>2</sup> 路
<a href="https://github.com/qiuyu96">Qiuyu Wang</a><sup>2</sup> 路
<a href="https://felixcheng97.github.io/">Ka Leong Cheng</a><sup>2</sup> 路
<a href="https://shenyujun.github.io/">Yujun Shen</a><sup>2</sup> 路
<a href="https://zhipengzhang.cn/">Zhipeng Zhang</a><sup>1,5,✉</sup>
</p>
<p align="center">
<sup>1</sup>AutoLab, SAI, SJTU
<sup>2</sup>Ant Group
<sup>3</sup>Department of Automation, Tsinghua University
<sup>4</sup>Zhejiang University
<sup>5</sup>Anyverse Dynamics
</p>
<h2 align="center">
<a href="https://arxiv.org/abs/2606.03972">馃搫 Paper</a> |
<a href="https://aad-1.github.io/">馃寪 Website</a> |
<a href="https://huggingface.co/Watay/AAD-1">馃 Models</a>
</h2>
We present **AAD-1**, an Asymmetric Adversarial Distillation framework for one-step autoregressive video world model generation. AAD-1 addresses motion collapse and training instability by combining an asymmetric generator-discriminator design with phased training: the generator remains causal for autoregressive sampling, while a bidirectional video-level discriminator scores full spatiotemporal sequences to detect global temporal failures and long-range drift. A distribution-matching warmup first bootstraps a stable one-step generator before adversarial distillation, enabling state-of-the-art one-step autoregressive video generation on VBench.

AAD-1 trains a one-step autoregressive generator in three stages. Stage I adapts a pretrained bidirectional video model into a causal generator with ODE initialization. Stage II performs one-step DMD warmup under self-rollout training. Stage III applies asymmetric adversarial refinement: the generator remains causal, while a bidirectional video-level discriminator observes full-video context to penalize temporal drift and motion collapse.
## Progress
- [x] 馃摑 Technical Report / Paper
- [x] 馃寪 Project Homepage
- [x] 馃捇 Inference Code
- [x] 馃 Pretrained Checkpoints
## Setup
Clone the repository:
```bash
git clone https://github.com/AutoLab-SAI-SJTU/AAD-1.git
cd AAD-1
```
Install with `uv`:
```bash
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install flash-attn --no-build-isolation
uv pip install -e .
```
Alternatively, use `conda`:
```bash
conda create -n AAD-1 python=3.10 -y
conda activate AAD-1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop
```
## Checkpoints
The public release path only needs:
1. 馃 [Official shared Wan model: Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)
2. 馃 [Released AAD-1 sharded generator checkpoint](https://huggingface.co/Watay/AAD-1)
Download the shared Wan components:
```bash
huggingface-cli download \
Wan-AI/Wan2.1-T2V-14B \
--local-dir-use-symlinks False \
--local-dir wan_models/Wan2.1-T2V-14B
```
If you use a custom shared Wan path, pass it explicitly with `--wan_model_dir`.
Download the AAD-1 sharded generator checkpoint:
```bash
huggingface-cli download \
Watay/AAD-1 \
--include "14b_i2v_1step_transformer/*" \
--local-dir-use-symlinks False \
--local-dir checkpoints
```
Optional 2-step checkpoint:
```bash
huggingface-cli download \
Watay/AAD-1 \
--include "14b_i2v_2step_transformer/*" \
--local-dir-use-symlinks False \
--local-dir checkpoints
```
## Quick Start
Run from the repository root. This command generates an 81-frame video from an input image with the `1step` checkpoint on a single GPU.
```bash
python aad1/inference.py \
--prompt "two people scuba diving in the ocean" \
--image_path assets/examples/scuba_diving_ocean.jpg \
--output_path outputs/aad1_scuba_1step.mp4 \
--checkpoint_path checkpoints/14b_i2v_1step_transformer/self_forcing_generator_bf16.index.json \
--wan_model_dir wan_models/Wan2.1-T2V-14B \
--num_frames 81 \
--seed 1000 \
--denoising_timestep_list 1000
```
Example `2step` command:
```bash
python aad1/inference.py \
--prompt "two people scuba diving in the ocean" \
--image_path assets/examples/scuba_diving_ocean.jpg \
--output_path outputs/aad1_scuba_2step.mp4 \
--checkpoint_path checkpoints/14b_i2v_2step_transformer/self_forcing_generator_bf16.index.json \
--wan_model_dir wan_models/Wan2.1-T2V-14B \
--num_frames 81 \
--seed 1000 \
--denoising_timestep_list 1000,500
```
More examples, including 20s generation, are in [docs/inference-examples.md](docs/inference-examples.md).
## Acknowledgements
We thank the authors and contributors of [Wan2.1](https://github.com/Wan-Video/Wan2.1), [CausVid](https://github.com/tianweiy/CausVid), [Self Forcing](https://github.com/guandeh17/Self-Forcing), and [FastVideo](https://github.com/hao-ai-lab/FastVideo) for their open research and codebases. AAD-1 builds on these foundations for causal video generation, distillation, and efficient inference.
|