cct / README.md
jwei302's picture
Initial commit
3d51313
|
Raw
History Blame Contribute Delete
2.97 kB
---
license: mit
tags:
- video-diffusion
- diffusion-forcing
- autoregressive-video
- world-models
---
# Clean-Context Autoregressive Video Diffusion — Checkpoints
Trained denoisers from *What Matters in Clean-Context Autoregressive Video Diffusion*. Each checkpoint is a Diffusion-Forcing video denoiser trained on
32-frame windows of 64×64 RGB; see the paper for the full setup and the
[code repository](https://github.com/jwei302/cct) for training and evaluation.
## Main checkpoints (16)
Four training configurations — **DF** (Diffusion Forcing baseline),
**Mask-only** (masked prefix loss), **Clean-only** (clean prefix), and
**Full** (clean prefix + masked loss) — crossed with two denoiser backbones
and two datasets. Naming: `{backbone}_{dataset}_{config}.ckpt`.
| File | Backbone | Dataset | Config |
|---|---|---|---|
| `unet_dmlab_df.ckpt` | U-Net | DMLab | DF |
| `unet_dmlab_mask_only.ckpt` | U-Net | DMLab | Mask-only |
| `unet_dmlab_clean_only.ckpt` | U-Net | DMLab | Clean-only |
| `unet_dmlab_full.ckpt` | U-Net | DMLab | Full |
| `unet_minecraft_df.ckpt` | U-Net | Minecraft | DF |
| `unet_minecraft_mask_only.ckpt` | U-Net | Minecraft | Mask-only |
| `unet_minecraft_clean_only.ckpt` | U-Net | Minecraft | Clean-only |
| `unet_minecraft_full.ckpt` | U-Net | Minecraft | Full |
| `dit_dmlab_df.ckpt` | DiT | DMLab | DF |
| `dit_dmlab_mask_only.ckpt` | DiT | DMLab | Mask-only |
| `dit_dmlab_clean_only.ckpt` | DiT | DMLab | Clean-only |
| `dit_dmlab_full.ckpt` | DiT | DMLab | Full |
| `dit_minecraft_df.ckpt` | DiT | Minecraft | DF |
| `dit_minecraft_mask_only.ckpt` | DiT | Minecraft | Mask-only |
| `dit_minecraft_clean_only.ckpt` | DiT | Minecraft | Clean-only |
| `dit_minecraft_full.ckpt` | DiT | Minecraft | Full |
All sixteen share the same optimiser, schedule, and diffusion settings
(AdamW, lr 8×10⁻⁵, 100k steps, batch 8, fp16, cosine schedule with K=1000,
v-prediction; DDIM with 100 steps, η=0 at inference); only the architecture
and the training configuration differ. The U-Net is the 3D-convolutional
Diffusion-Forcing backbone (≈18.65 M params); the DiT is a strictly
frame-causal diffusion transformer (≈18.84 M params).
## Ablation checkpoints (`ablation/`)
The causal-GroupNorm ablation (paper §4 / Table 4): the U-Net trained on
DMLab to a matched 30k-step budget, with the standard (leaky) temporal
GroupNorm versus a frame-causal variant, under DF and Full.
| File | GroupNorm | Config |
|---|---|---|
| `ablation/unet_dmlab_leaky_df.ckpt` | leaky (standard) | DF |
| `ablation/unet_dmlab_leaky_full.ckpt` | leaky (standard) | Full |
| `ablation/unet_dmlab_framecausal_df.ckpt` | frame-causal | DF |
| `ablation/unet_dmlab_framecausal_full.ckpt` | frame-causal | Full |
## Loading
These are PyTorch-Lightning checkpoints. Load them with the matching config
from the [code repository](https://github.com/jwei302/cct); the U-Net and
DiT backbones and all training settings are specified there.