AAD-1 / README.md
Watay's picture
Use video world model wording
a8a7ccc verified
metadata
license: cc-by-nc-sa-4.0
library_name: pytorch
pipeline_tag: image-to-video
tags:
  - image-to-video
  - video-generation
  - autoregressive-video-generation
  - one-step-generation
  - adversarial-distillation
  - wan
base_model:
  - Wan-AI/Wan2.1-T2V-14B

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Haobo Li1,2 · Yanhong Zeng2,3,✉ · Yunhong Lu4,2 · Jiapeng Zhu2 · Hao Ouyang2 · Qiuyu Wang2 · Ka Leong Cheng2 · Yujun Shen2 · Zhipeng Zhang1,5,✉

1AutoLab, SAI, SJTU 2Ant Group 3Department of Automation, Tsinghua University 4Zhejiang University 5Anyverse Dynamics

📄 Paper | 🌐 Website | 🤗 Models

We present AAD-1, an Asymmetric Adversarial Distillation framework for one-step autoregressive video world model generation. AAD-1 addresses motion collapse and training instability by combining an asymmetric generator-discriminator design with phased training: the generator remains causal for autoregressive sampling, while a bidirectional video-level discriminator scores full spatiotemporal sequences to detect global temporal failures and long-range drift. A distribution-matching warmup first bootstraps a stable one-step generator before adversarial distillation, enabling state-of-the-art one-step autoregressive video generation on VBench.

AAD-1 training pipeline

AAD-1 trains a one-step autoregressive generator in three stages. Stage I adapts a pretrained bidirectional video model into a causal generator with ODE initialization. Stage II performs one-step DMD warmup under self-rollout training. Stage III applies asymmetric adversarial refinement: the generator remains causal, while a bidirectional video-level discriminator observes full-video context to penalize temporal drift and motion collapse.

Progress

  • 📝 Technical Report / Paper
  • 🌐 Project Homepage
  • 💻 Inference Code
  • 🤗 Pretrained Checkpoints

Setup

Clone the repository:

git clone https://github.com/AutoLab-SAI-SJTU/AAD-1.git
cd AAD-1

Install with uv:

uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install flash-attn --no-build-isolation
uv pip install -e .

Alternatively, use conda:

conda create -n AAD-1 python=3.10 -y
conda activate AAD-1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop

Checkpoints

The public release path only needs:

  1. 🤗 Official shared Wan model: Wan2.1-T2V-14B
  2. 🤗 Released AAD-1 sharded generator checkpoint

Download the shared Wan components:

huggingface-cli download \
  Wan-AI/Wan2.1-T2V-14B \
  --local-dir-use-symlinks False \
  --local-dir wan_models/Wan2.1-T2V-14B

If you use a custom shared Wan path, pass it explicitly with --wan_model_dir.

Download the AAD-1 sharded generator checkpoint:

huggingface-cli download \
  Watay/AAD-1 \
  --include "14b_i2v_1step_transformer/*" \
  --local-dir-use-symlinks False \
  --local-dir checkpoints

Optional 2-step checkpoint:

huggingface-cli download \
  Watay/AAD-1 \
  --include "14b_i2v_2step_transformer/*" \
  --local-dir-use-symlinks False \
  --local-dir checkpoints

Quick Start

Run from the repository root. This command generates an 81-frame video from an input image with the 1step checkpoint on a single GPU.

python aad1/inference.py \
  --prompt "two people scuba diving in the ocean" \
  --image_path assets/examples/scuba_diving_ocean.jpg \
  --output_path outputs/aad1_scuba_1step.mp4 \
  --checkpoint_path checkpoints/14b_i2v_1step_transformer/self_forcing_generator_bf16.index.json \
  --wan_model_dir wan_models/Wan2.1-T2V-14B \
  --num_frames 81 \
  --seed 1000 \
  --denoising_timestep_list 1000

Example 2step command:

python aad1/inference.py \
  --prompt "two people scuba diving in the ocean" \
  --image_path assets/examples/scuba_diving_ocean.jpg \
  --output_path outputs/aad1_scuba_2step.mp4 \
  --checkpoint_path checkpoints/14b_i2v_2step_transformer/self_forcing_generator_bf16.index.json \
  --wan_model_dir wan_models/Wan2.1-T2V-14B \
  --num_frames 81 \
  --seed 1000 \
  --denoising_timestep_list 1000,500

More examples, including 20s generation, are in docs/inference-examples.md.

Acknowledgements

We thank the authors and contributors of Wan2.1, CausVid, Self Forcing, and FastVideo for their open research and codebases. AAD-1 builds on these foundations for causal video generation, distillation, and efficient inference.