| --- |
| license: mit |
| tags: |
| - video-diffusion |
| - diffusion-forcing |
| - autoregressive-video |
| - world-models |
| --- |
| |
| # Clean-Context Autoregressive Video Diffusion — Checkpoints |
|
|
| Trained denoisers from *What Matters in Clean-Context Autoregressive Video Diffusion*. Each checkpoint is a Diffusion-Forcing video denoiser trained on |
| 32-frame windows of 64×64 RGB; see the paper for the full setup and the |
| [code repository](https://github.com/jwei302/cct) for training and evaluation. |
|
|
| ## Main checkpoints (16) |
|
|
| Four training configurations — **DF** (Diffusion Forcing baseline), |
| **Mask-only** (masked prefix loss), **Clean-only** (clean prefix), and |
| **Full** (clean prefix + masked loss) — crossed with two denoiser backbones |
| and two datasets. Naming: `{backbone}_{dataset}_{config}.ckpt`. |
|
|
| | File | Backbone | Dataset | Config | |
| |---|---|---|---| |
| | `unet_dmlab_df.ckpt` | U-Net | DMLab | DF | |
| | `unet_dmlab_mask_only.ckpt` | U-Net | DMLab | Mask-only | |
| | `unet_dmlab_clean_only.ckpt` | U-Net | DMLab | Clean-only | |
| | `unet_dmlab_full.ckpt` | U-Net | DMLab | Full | |
| | `unet_minecraft_df.ckpt` | U-Net | Minecraft | DF | |
| | `unet_minecraft_mask_only.ckpt` | U-Net | Minecraft | Mask-only | |
| | `unet_minecraft_clean_only.ckpt` | U-Net | Minecraft | Clean-only | |
| | `unet_minecraft_full.ckpt` | U-Net | Minecraft | Full | |
| | `dit_dmlab_df.ckpt` | DiT | DMLab | DF | |
| | `dit_dmlab_mask_only.ckpt` | DiT | DMLab | Mask-only | |
| | `dit_dmlab_clean_only.ckpt` | DiT | DMLab | Clean-only | |
| | `dit_dmlab_full.ckpt` | DiT | DMLab | Full | |
| | `dit_minecraft_df.ckpt` | DiT | Minecraft | DF | |
| | `dit_minecraft_mask_only.ckpt` | DiT | Minecraft | Mask-only | |
| | `dit_minecraft_clean_only.ckpt` | DiT | Minecraft | Clean-only | |
| | `dit_minecraft_full.ckpt` | DiT | Minecraft | Full | |
|
|
| All sixteen share the same optimiser, schedule, and diffusion settings |
| (AdamW, lr 8×10⁻⁵, 100k steps, batch 8, fp16, cosine schedule with K=1000, |
| v-prediction; DDIM with 100 steps, η=0 at inference); only the architecture |
| and the training configuration differ. The U-Net is the 3D-convolutional |
| Diffusion-Forcing backbone (≈18.65 M params); the DiT is a strictly |
| frame-causal diffusion transformer (≈18.84 M params). |
|
|
| ## Ablation checkpoints (`ablation/`) |
|
|
| The causal-GroupNorm ablation (paper §4 / Table 4): the U-Net trained on |
| DMLab to a matched 30k-step budget, with the standard (leaky) temporal |
| GroupNorm versus a frame-causal variant, under DF and Full. |
|
|
| | File | GroupNorm | Config | |
| |---|---|---| |
| | `ablation/unet_dmlab_leaky_df.ckpt` | leaky (standard) | DF | |
| | `ablation/unet_dmlab_leaky_full.ckpt` | leaky (standard) | Full | |
| | `ablation/unet_dmlab_framecausal_df.ckpt` | frame-causal | DF | |
| | `ablation/unet_dmlab_framecausal_full.ckpt` | frame-causal | Full | |
|
|
| ## Loading |
|
|
| These are PyTorch-Lightning checkpoints. Load them with the matching config |
| from the [code repository](https://github.com/jwei302/cct); the U-Net and |
| DiT backbones and all training settings are specified there. |
|
|