license: cc-by-nc-sa-4.0
library_name: pytorch
pipeline_tag: image-to-video
tags:
- image-to-video
- video-generation
- autoregressive-video-generation
- one-step-generation
- adversarial-distillation
- wan
base_model:
- Wan-AI/Wan2.1-T2V-14B
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation
Haobo Li1,2 · Yanhong Zeng2,3,✉ · Yunhong Lu4,2 · Jiapeng Zhu2 · Hao Ouyang2 · Qiuyu Wang2 · Ka Leong Cheng2 · Yujun Shen2 · Zhipeng Zhang1,5,✉
1AutoLab, SAI, SJTU 2Ant Group 3Department of Automation, Tsinghua University 4Zhejiang University 5Anyverse Dynamics
📄 Paper | 🌐 Website | 🤗 Models
We present AAD-1, an Asymmetric Adversarial Distillation framework for one-step autoregressive video world model generation. AAD-1 addresses motion collapse and training instability by combining an asymmetric generator-discriminator design with phased training: the generator remains causal for autoregressive sampling, while a bidirectional video-level discriminator scores full spatiotemporal sequences to detect global temporal failures and long-range drift. A distribution-matching warmup first bootstraps a stable one-step generator before adversarial distillation, enabling state-of-the-art one-step autoregressive video generation on VBench.
AAD-1 trains a one-step autoregressive generator in three stages. Stage I adapts a pretrained bidirectional video model into a causal generator with ODE initialization. Stage II performs one-step DMD warmup under self-rollout training. Stage III applies asymmetric adversarial refinement: the generator remains causal, while a bidirectional video-level discriminator observes full-video context to penalize temporal drift and motion collapse.
Progress
- 📝 Technical Report / Paper
- 🌐 Project Homepage
- 💻 Inference Code
- 🤗 Pretrained Checkpoints
Setup
Clone the repository:
git clone https://github.com/AutoLab-SAI-SJTU/AAD-1.git
cd AAD-1
Install with uv:
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install flash-attn --no-build-isolation
uv pip install -e .
Alternatively, use conda:
conda create -n AAD-1 python=3.10 -y
conda activate AAD-1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop
Checkpoints
The public release path only needs:
Download the shared Wan components:
huggingface-cli download \
Wan-AI/Wan2.1-T2V-14B \
--local-dir-use-symlinks False \
--local-dir wan_models/Wan2.1-T2V-14B
If you use a custom shared Wan path, pass it explicitly with --wan_model_dir.
Download the AAD-1 sharded generator checkpoint:
huggingface-cli download \
Watay/AAD-1 \
--include "14b_i2v_1step_transformer/*" \
--local-dir-use-symlinks False \
--local-dir checkpoints
Optional 2-step checkpoint:
huggingface-cli download \
Watay/AAD-1 \
--include "14b_i2v_2step_transformer/*" \
--local-dir-use-symlinks False \
--local-dir checkpoints
Quick Start
Run from the repository root. This command generates an 81-frame video from an input image with the 1step checkpoint on a single GPU.
python aad1/inference.py \
--prompt "two people scuba diving in the ocean" \
--image_path assets/examples/scuba_diving_ocean.jpg \
--output_path outputs/aad1_scuba_1step.mp4 \
--checkpoint_path checkpoints/14b_i2v_1step_transformer/self_forcing_generator_bf16.index.json \
--wan_model_dir wan_models/Wan2.1-T2V-14B \
--num_frames 81 \
--seed 1000 \
--denoising_timestep_list 1000
Example 2step command:
python aad1/inference.py \
--prompt "two people scuba diving in the ocean" \
--image_path assets/examples/scuba_diving_ocean.jpg \
--output_path outputs/aad1_scuba_2step.mp4 \
--checkpoint_path checkpoints/14b_i2v_2step_transformer/self_forcing_generator_bf16.index.json \
--wan_model_dir wan_models/Wan2.1-T2V-14B \
--num_frames 81 \
--seed 1000 \
--denoising_timestep_list 1000,500
More examples, including 20s generation, are in docs/inference-examples.md.
Acknowledgements
We thank the authors and contributors of Wan2.1, CausVid, Self Forcing, and FastVideo for their open research and codebases. AAD-1 builds on these foundations for causal video generation, distillation, and efficient inference.
