---
license: mit
tags:
  - video-diffusion
  - diffusion-forcing
  - autoregressive-video
  - world-models
---

# Clean-Context Autoregressive Video Diffusion — Checkpoints

Trained denoisers from *What Matters in Clean-Context Autoregressive Video Diffusion*. Each checkpoint is a Diffusion-Forcing video denoiser trained on
32-frame windows of 64×64 RGB; see the paper for the full setup and the
[code repository](https://github.com/jwei302/cct) for training and evaluation.

## Main checkpoints (16)

Four training configurations — **DF** (Diffusion Forcing baseline),
**Mask-only** (masked prefix loss), **Clean-only** (clean prefix), and
**Full** (clean prefix + masked loss) — crossed with two denoiser backbones
and two datasets. Naming: `{backbone}_{dataset}_{config}.ckpt`.

| File | Backbone | Dataset | Config |
|---|---|---|---|
| `unet_dmlab_df.ckpt` | U-Net | DMLab | DF |
| `unet_dmlab_mask_only.ckpt` | U-Net | DMLab | Mask-only |
| `unet_dmlab_clean_only.ckpt` | U-Net | DMLab | Clean-only |
| `unet_dmlab_full.ckpt` | U-Net | DMLab | Full |
| `unet_minecraft_df.ckpt` | U-Net | Minecraft | DF |
| `unet_minecraft_mask_only.ckpt` | U-Net | Minecraft | Mask-only |
| `unet_minecraft_clean_only.ckpt` | U-Net | Minecraft | Clean-only |
| `unet_minecraft_full.ckpt` | U-Net | Minecraft | Full |
| `dit_dmlab_df.ckpt` | DiT | DMLab | DF |
| `dit_dmlab_mask_only.ckpt` | DiT | DMLab | Mask-only |
| `dit_dmlab_clean_only.ckpt` | DiT | DMLab | Clean-only |
| `dit_dmlab_full.ckpt` | DiT | DMLab | Full |
| `dit_minecraft_df.ckpt` | DiT | Minecraft | DF |
| `dit_minecraft_mask_only.ckpt` | DiT | Minecraft | Mask-only |
| `dit_minecraft_clean_only.ckpt` | DiT | Minecraft | Clean-only |
| `dit_minecraft_full.ckpt` | DiT | Minecraft | Full |

All sixteen share the same optimiser, schedule, and diffusion settings
(AdamW, lr 8×10⁻⁵, 100k steps, batch 8, fp16, cosine schedule with K=1000,
v-prediction; DDIM with 100 steps, η=0 at inference); only the architecture
and the training configuration differ. The U-Net is the 3D-convolutional
Diffusion-Forcing backbone (≈18.65 M params); the DiT is a strictly
frame-causal diffusion transformer (≈18.84 M params).

## Ablation checkpoints (`ablation/`)

The causal-GroupNorm ablation (paper §4 / Table 4): the U-Net trained on
DMLab to a matched 30k-step budget, with the standard (leaky) temporal
GroupNorm versus a frame-causal variant, under DF and Full.

| File | GroupNorm | Config |
|---|---|---|
| `ablation/unet_dmlab_leaky_df.ckpt` | leaky (standard) | DF |
| `ablation/unet_dmlab_leaky_full.ckpt` | leaky (standard) | Full |
| `ablation/unet_dmlab_framecausal_df.ckpt` | frame-causal | DF |
| `ablation/unet_dmlab_framecausal_full.ckpt` | frame-causal | Full |

## Loading

These are PyTorch-Lightning checkpoints. Load them with the matching config
from the [code repository](https://github.com/jwei302/cct); the U-Net and
DiT backbones and all training settings are specified there.