--- license: cc-by-nc-sa-4.0 library_name: pytorch pipeline_tag: image-to-video tags: - image-to-video - video-generation - autoregressive-video-generation - one-step-generation - adversarial-distillation - wan base_model: - Wan-AI/Wan2.1-T2V-14B --- # AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Haobo Li^1,2 · Yanhong Zeng^2,3,✉ · Yunhong Lu^4,2 · Jiapeng Zhu² · Hao Ouyang² · Qiuyu Wang² · Ka Leong Cheng² · Yujun Shen² · Zhipeng Zhang^1,5,✉

¹AutoLab, SAI, SJTU ²Ant Group ³Department of Automation, Tsinghua University ⁴Zhejiang University ⁵Anyverse Dynamics

📄 Paper | 🌐 Website | 🤗 Models

We present **AAD-1**, an Asymmetric Adversarial Distillation framework for one-step autoregressive video world model generation. AAD-1 addresses motion collapse and training instability by combining an asymmetric generator-discriminator design with phased training: the generator remains causal for autoregressive sampling, while a bidirectional video-level discriminator scores full spatiotemporal sequences to detect global temporal failures and long-range drift. A distribution-matching warmup first bootstraps a stable one-step generator before adversarial distillation, enabling state-of-the-art one-step autoregressive video generation on VBench. ![AAD-1 training pipeline](assets/training_pipeline.png) AAD-1 trains a one-step autoregressive generator in three stages. Stage I adapts a pretrained bidirectional video model into a causal generator with ODE initialization. Stage II performs one-step DMD warmup under self-rollout training. Stage III applies asymmetric adversarial refinement: the generator remains causal, while a bidirectional video-level discriminator observes full-video context to penalize temporal drift and motion collapse. ## Progress - [x] 📝 Technical Report / Paper - [x] 🌐 Project Homepage - [x] 💻 Inference Code - [x] 🤗 Pretrained Checkpoints ## Setup Clone the repository: ```bash git clone https://github.com/AutoLab-SAI-SJTU/AAD-1.git cd AAD-1 ``` Install with `uv`: ```bash uv venv --python 3.10 source .venv/bin/activate uv pip install -r requirements.txt uv pip install flash-attn --no-build-isolation uv pip install -e . ``` Alternatively, use `conda`: ```bash conda create -n AAD-1 python=3.10 -y conda activate AAD-1 pip install -r requirements.txt pip install flash-attn --no-build-isolation python setup.py develop ``` ## Checkpoints The public release path only needs: 1. 🤗 [Official shared Wan model: Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) 2. 🤗 [Released AAD-1 sharded generator checkpoint](https://huggingface.co/Watay/AAD-1) Download the shared Wan components: ```bash huggingface-cli download \ Wan-AI/Wan2.1-T2V-14B \ --local-dir-use-symlinks False \ --local-dir wan_models/Wan2.1-T2V-14B ``` If you use a custom shared Wan path, pass it explicitly with `--wan_model_dir`. Download the AAD-1 sharded generator checkpoint: ```bash huggingface-cli download \ Watay/AAD-1 \ --include "14b_i2v_1step_transformer/*" \ --local-dir-use-symlinks False \ --local-dir checkpoints ``` Optional 2-step checkpoint: ```bash huggingface-cli download \ Watay/AAD-1 \ --include "14b_i2v_2step_transformer/*" \ --local-dir-use-symlinks False \ --local-dir checkpoints ``` ## Quick Start Run from the repository root. This command generates an 81-frame video from an input image with the `1step` checkpoint on a single GPU. ```bash python aad1/inference.py \ --prompt "two people scuba diving in the ocean" \ --image_path assets/examples/scuba_diving_ocean.jpg \ --output_path outputs/aad1_scuba_1step.mp4 \ --checkpoint_path checkpoints/14b_i2v_1step_transformer/self_forcing_generator_bf16.index.json \ --wan_model_dir wan_models/Wan2.1-T2V-14B \ --num_frames 81 \ --seed 1000 \ --denoising_timestep_list 1000 ``` Example `2step` command: ```bash python aad1/inference.py \ --prompt "two people scuba diving in the ocean" \ --image_path assets/examples/scuba_diving_ocean.jpg \ --output_path outputs/aad1_scuba_2step.mp4 \ --checkpoint_path checkpoints/14b_i2v_2step_transformer/self_forcing_generator_bf16.index.json \ --wan_model_dir wan_models/Wan2.1-T2V-14B \ --num_frames 81 \ --seed 1000 \ --denoising_timestep_list 1000,500 ``` More examples, including 20s generation, are in [docs/inference-examples.md](docs/inference-examples.md). ## Acknowledgements We thank the authors and contributors of [Wan2.1](https://github.com/Wan-Video/Wan2.1), [CausVid](https://github.com/tianweiy/CausVid), [Self Forcing](https://github.com/guandeh17/Self-Forcing), and [FastVideo](https://github.com/hao-ai-lab/FastVideo) for their open research and codebases. AAD-1 builds on these foundations for causal video generation, distillation, and efficient inference.