--- license: mit tags: - video-diffusion - diffusion-forcing - autoregressive-video - world-models --- # Clean-Context Autoregressive Video Diffusion — Checkpoints Trained denoisers from *What Matters in Clean-Context Autoregressive Video Diffusion*. Each checkpoint is a Diffusion-Forcing video denoiser trained on 32-frame windows of 64×64 RGB; see the paper for the full setup and the [code repository](https://github.com/jwei302/cct) for training and evaluation. ## Main checkpoints (16) Four training configurations — **DF** (Diffusion Forcing baseline), **Mask-only** (masked prefix loss), **Clean-only** (clean prefix), and **Full** (clean prefix + masked loss) — crossed with two denoiser backbones and two datasets. Naming: `{backbone}_{dataset}_{config}.ckpt`. | File | Backbone | Dataset | Config | |---|---|---|---| | `unet_dmlab_df.ckpt` | U-Net | DMLab | DF | | `unet_dmlab_mask_only.ckpt` | U-Net | DMLab | Mask-only | | `unet_dmlab_clean_only.ckpt` | U-Net | DMLab | Clean-only | | `unet_dmlab_full.ckpt` | U-Net | DMLab | Full | | `unet_minecraft_df.ckpt` | U-Net | Minecraft | DF | | `unet_minecraft_mask_only.ckpt` | U-Net | Minecraft | Mask-only | | `unet_minecraft_clean_only.ckpt` | U-Net | Minecraft | Clean-only | | `unet_minecraft_full.ckpt` | U-Net | Minecraft | Full | | `dit_dmlab_df.ckpt` | DiT | DMLab | DF | | `dit_dmlab_mask_only.ckpt` | DiT | DMLab | Mask-only | | `dit_dmlab_clean_only.ckpt` | DiT | DMLab | Clean-only | | `dit_dmlab_full.ckpt` | DiT | DMLab | Full | | `dit_minecraft_df.ckpt` | DiT | Minecraft | DF | | `dit_minecraft_mask_only.ckpt` | DiT | Minecraft | Mask-only | | `dit_minecraft_clean_only.ckpt` | DiT | Minecraft | Clean-only | | `dit_minecraft_full.ckpt` | DiT | Minecraft | Full | All sixteen share the same optimiser, schedule, and diffusion settings (AdamW, lr 8×10⁻⁵, 100k steps, batch 8, fp16, cosine schedule with K=1000, v-prediction; DDIM with 100 steps, η=0 at inference); only the architecture and the training configuration differ. The U-Net is the 3D-convolutional Diffusion-Forcing backbone (≈18.65 M params); the DiT is a strictly frame-causal diffusion transformer (≈18.84 M params). ## Ablation checkpoints (`ablation/`) The causal-GroupNorm ablation (paper §4 / Table 4): the U-Net trained on DMLab to a matched 30k-step budget, with the standard (leaky) temporal GroupNorm versus a frame-causal variant, under DF and Full. | File | GroupNorm | Config | |---|---|---| | `ablation/unet_dmlab_leaky_df.ckpt` | leaky (standard) | DF | | `ablation/unet_dmlab_leaky_full.ckpt` | leaky (standard) | Full | | `ablation/unet_dmlab_framecausal_df.ckpt` | frame-causal | DF | | `ablation/unet_dmlab_framecausal_full.ckpt` | frame-causal | Full | ## Loading These are PyTorch-Lightning checkpoints. Load them with the matching config from the [code repository](https://github.com/jwei302/cct); the U-Net and DiT backbones and all training settings are specified there.