File size: 5,677 Bytes
a6b3b87
b6816fd
 
 
 
 
 
 
 
 
 
 
 
a6b3b87
b6816fd
 
 
 
b0c70f6
9eaa8ec
b0c70f6
673907a
 
 
 
 
9eaa8ec
b6816fd
 
673907a
5a3445c
 
 
 
b0c70f6
673907a
b6816fd
673907a
a37cd5f
673907a
 
 
b6816fd
a8a7ccc
b6816fd
673907a
b6816fd
673907a
b6816fd
673907a
b6816fd
673907a
 
 
 
b6816fd
673907a
b6816fd
673907a
b6816fd
 
 
 
673907a
 
 
b6816fd
673907a
b6816fd
 
 
 
 
 
 
673907a
b6816fd
 
3f3815f
 
b6816fd
 
 
 
 
673907a
 
 
b6816fd
673907a
 
 
 
b6816fd
 
9807adc
b6816fd
 
 
 
 
673907a
 
 
b6816fd
 
9807adc
b6816fd
 
 
 
 
 
 
 
 
9807adc
b6816fd
 
 
 
 
 
 
 
673907a
 
b6816fd
 
 
4dda852
b6816fd
 
 
 
 
 
 
 
673907a
b6816fd
 
673907a
 
4dda852
673907a
 
 
 
 
 
b6816fd
 
673907a
b6816fd
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: cc-by-nc-sa-4.0
library_name: pytorch
pipeline_tag: image-to-video
tags:
  - image-to-video
  - video-generation
  - autoregressive-video-generation
  - one-step-generation
  - adversarial-distillation
  - wan
base_model:
  - Wan-AI/Wan2.1-T2V-14B
---

# AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

<p align="center">
  <a href="https://github.com/HaobroLi">Haobo Li</a><sup>1,2</sup><a href="https://zengyh1900.github.io/">Yanhong Zeng</a><sup>2,3,&#9993;</sup><a href="https://github.com/JaydenLyh">Yunhong Lu</a><sup>4,2</sup><a href="https://github.com/zhujiapeng">Jiapeng Zhu</a><sup>2</sup><a href="https://ken-ouyang.github.io/">Hao Ouyang</a><sup>2</sup><a href="https://github.com/qiuyu96">Qiuyu Wang</a><sup>2</sup><a href="https://felixcheng97.github.io/">Ka Leong Cheng</a><sup>2</sup><a href="https://shenyujun.github.io/">Yujun Shen</a><sup>2</sup><a href="https://zhipengzhang.cn/">Zhipeng Zhang</a><sup>1,5,&#9993;</sup>
</p>

<p align="center">
  <sup>1</sup>AutoLab, SAI, SJTU
  <sup>2</sup>Ant Group
  <sup>3</sup>Department of Automation, Tsinghua University
  <sup>4</sup>Zhejiang University
  <sup>5</sup>Anyverse Dynamics
</p>

<h2 align="center">
  <a href="https://arxiv.org/abs/2606.03972">馃搫 Paper</a> |
  <a href="https://aad-1.github.io/">馃寪 Website</a> |
  <a href="https://huggingface.co/Watay/AAD-1">馃 Models</a>
</h2>

We present **AAD-1**, an Asymmetric Adversarial Distillation framework for one-step autoregressive video world model generation. AAD-1 addresses motion collapse and training instability by combining an asymmetric generator-discriminator design with phased training: the generator remains causal for autoregressive sampling, while a bidirectional video-level discriminator scores full spatiotemporal sequences to detect global temporal failures and long-range drift. A distribution-matching warmup first bootstraps a stable one-step generator before adversarial distillation, enabling state-of-the-art one-step autoregressive video generation on VBench.

![AAD-1 training pipeline](assets/training_pipeline.png)

AAD-1 trains a one-step autoregressive generator in three stages. Stage I adapts a pretrained bidirectional video model into a causal generator with ODE initialization. Stage II performs one-step DMD warmup under self-rollout training. Stage III applies asymmetric adversarial refinement: the generator remains causal, while a bidirectional video-level discriminator observes full-video context to penalize temporal drift and motion collapse.

## Progress

- [x] 馃摑 Technical Report / Paper
- [x] 馃寪 Project Homepage
- [x] 馃捇 Inference Code
- [x] 馃 Pretrained Checkpoints

## Setup

Clone the repository:

```bash
git clone https://github.com/AutoLab-SAI-SJTU/AAD-1.git
cd AAD-1
```

Install with `uv`:

```bash
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install flash-attn --no-build-isolation
uv pip install -e .
```

Alternatively, use `conda`:

```bash
conda create -n AAD-1 python=3.10 -y
conda activate AAD-1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop
```

## Checkpoints

The public release path only needs:

1. 馃 [Official shared Wan model: Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)
2. 馃 [Released AAD-1 sharded generator checkpoint](https://huggingface.co/Watay/AAD-1)

Download the shared Wan components:

```bash
huggingface-cli download \
  Wan-AI/Wan2.1-T2V-14B \
  --local-dir-use-symlinks False \
  --local-dir wan_models/Wan2.1-T2V-14B
```

If you use a custom shared Wan path, pass it explicitly with `--wan_model_dir`.

Download the AAD-1 sharded generator checkpoint:

```bash
huggingface-cli download \
  Watay/AAD-1 \
  --include "14b_i2v_1step_transformer/*" \
  --local-dir-use-symlinks False \
  --local-dir checkpoints
```

Optional 2-step checkpoint:

```bash
huggingface-cli download \
  Watay/AAD-1 \
  --include "14b_i2v_2step_transformer/*" \
  --local-dir-use-symlinks False \
  --local-dir checkpoints
```

## Quick Start

Run from the repository root. This command generates an 81-frame video from an input image with the `1step` checkpoint on a single GPU.

```bash
python aad1/inference.py \
  --prompt "two people scuba diving in the ocean" \
  --image_path assets/examples/scuba_diving_ocean.jpg \
  --output_path outputs/aad1_scuba_1step.mp4 \
  --checkpoint_path checkpoints/14b_i2v_1step_transformer/self_forcing_generator_bf16.index.json \
  --wan_model_dir wan_models/Wan2.1-T2V-14B \
  --num_frames 81 \
  --seed 1000 \
  --denoising_timestep_list 1000
```

Example `2step` command:

```bash
python aad1/inference.py \
  --prompt "two people scuba diving in the ocean" \
  --image_path assets/examples/scuba_diving_ocean.jpg \
  --output_path outputs/aad1_scuba_2step.mp4 \
  --checkpoint_path checkpoints/14b_i2v_2step_transformer/self_forcing_generator_bf16.index.json \
  --wan_model_dir wan_models/Wan2.1-T2V-14B \
  --num_frames 81 \
  --seed 1000 \
  --denoising_timestep_list 1000,500
```

More examples, including 20s generation, are in [docs/inference-examples.md](docs/inference-examples.md).

## Acknowledgements

We thank the authors and contributors of [Wan2.1](https://github.com/Wan-Video/Wan2.1), [CausVid](https://github.com/tianweiy/CausVid), [Self Forcing](https://github.com/guandeh17/Self-Forcing), and [FastVideo](https://github.com/hao-ai-lab/FastVideo) for their open research and codebases. AAD-1 builds on these foundations for causal video generation, distillation, and efficient inference.