cct / README.md

Initial commit

3d51313 24 days ago

2.97 kB

	---
	license: mit
	tags:
	- video-diffusion
	- diffusion-forcing
	- autoregressive-video
	- world-models
	---

	# Clean-Context Autoregressive Video Diffusion — Checkpoints

	Trained denoisers from What Matters in Clean-Context Autoregressive Video Diffusion. Each checkpoint is a Diffusion-Forcing video denoiser trained on
	32-frame windows of 64×64 RGB; see the paper for the full setup and the
	[code repository](https://github.com/jwei302/cct) for training and evaluation.

	## Main checkpoints (16)

	Four training configurations — DF (Diffusion Forcing baseline),
	Mask-only (masked prefix loss), Clean-only (clean prefix), and
	Full (clean prefix + masked loss) — crossed with two denoiser backbones
	and two datasets. Naming: `{backbone}_{dataset}_{config}.ckpt`.

	\| File \| Backbone \| Dataset \| Config \|
	\|---\|---\|---\|---\|
	\| `unet_dmlab_df.ckpt` \| U-Net \| DMLab \| DF \|
	\| `unet_dmlab_mask_only.ckpt` \| U-Net \| DMLab \| Mask-only \|
	\| `unet_dmlab_clean_only.ckpt` \| U-Net \| DMLab \| Clean-only \|
	\| `unet_dmlab_full.ckpt` \| U-Net \| DMLab \| Full \|
	\| `unet_minecraft_df.ckpt` \| U-Net \| Minecraft \| DF \|
	\| `unet_minecraft_mask_only.ckpt` \| U-Net \| Minecraft \| Mask-only \|
	\| `unet_minecraft_clean_only.ckpt` \| U-Net \| Minecraft \| Clean-only \|
	\| `unet_minecraft_full.ckpt` \| U-Net \| Minecraft \| Full \|
	\| `dit_dmlab_df.ckpt` \| DiT \| DMLab \| DF \|
	\| `dit_dmlab_mask_only.ckpt` \| DiT \| DMLab \| Mask-only \|
	\| `dit_dmlab_clean_only.ckpt` \| DiT \| DMLab \| Clean-only \|
	\| `dit_dmlab_full.ckpt` \| DiT \| DMLab \| Full \|
	\| `dit_minecraft_df.ckpt` \| DiT \| Minecraft \| DF \|
	\| `dit_minecraft_mask_only.ckpt` \| DiT \| Minecraft \| Mask-only \|
	\| `dit_minecraft_clean_only.ckpt` \| DiT \| Minecraft \| Clean-only \|
	\| `dit_minecraft_full.ckpt` \| DiT \| Minecraft \| Full \|

	All sixteen share the same optimiser, schedule, and diffusion settings
	(AdamW, lr 8×10⁻⁵, 100k steps, batch 8, fp16, cosine schedule with K=1000,
	v-prediction; DDIM with 100 steps, η=0 at inference); only the architecture
	and the training configuration differ. The U-Net is the 3D-convolutional
	Diffusion-Forcing backbone (≈18.65 M params); the DiT is a strictly
	frame-causal diffusion transformer (≈18.84 M params).

	## Ablation checkpoints (`ablation/`)

	The causal-GroupNorm ablation (paper §4 / Table 4): the U-Net trained on
	DMLab to a matched 30k-step budget, with the standard (leaky) temporal
	GroupNorm versus a frame-causal variant, under DF and Full.

	\| File \| GroupNorm \| Config \|
	\|---\|---\|---\|
	\| `ablation/unet_dmlab_leaky_df.ckpt` \| leaky (standard) \| DF \|
	\| `ablation/unet_dmlab_leaky_full.ckpt` \| leaky (standard) \| Full \|
	\| `ablation/unet_dmlab_framecausal_df.ckpt` \| frame-causal \| DF \|
	\| `ablation/unet_dmlab_framecausal_full.ckpt` \| frame-causal \| Full \|

	## Loading

	These are PyTorch-Lightning checkpoints. Load them with the matching config
	from the [code repository](https://github.com/jwei302/cct); the U-Net and
	DiT backbones and all training settings are specified there.

	---
	license: mit
	tags:
	- video-diffusion
	- diffusion-forcing
	- autoregressive-video
	- world-models
	---

	# Clean-Context Autoregressive Video Diffusion — Checkpoints

	Trained denoisers from What Matters in Clean-Context Autoregressive Video Diffusion. Each checkpoint is a Diffusion-Forcing video denoiser trained on
	32-frame windows of 64×64 RGB; see the paper for the full setup and the
	[code repository](https://github.com/jwei302/cct) for training and evaluation.

	## Main checkpoints (16)

	Four training configurations — DF (Diffusion Forcing baseline),
	Mask-only (masked prefix loss), Clean-only (clean prefix), and
	Full (clean prefix + masked loss) — crossed with two denoiser backbones
	and two datasets. Naming: `{backbone}_{dataset}_{config}.ckpt`.

	\| File \| Backbone \| Dataset \| Config \|
	\|---\|---\|---\|---\|
	\| `unet_dmlab_df.ckpt` \| U-Net \| DMLab \| DF \|
	\| `unet_dmlab_mask_only.ckpt` \| U-Net \| DMLab \| Mask-only \|
	\| `unet_dmlab_clean_only.ckpt` \| U-Net \| DMLab \| Clean-only \|
	\| `unet_dmlab_full.ckpt` \| U-Net \| DMLab \| Full \|
	\| `unet_minecraft_df.ckpt` \| U-Net \| Minecraft \| DF \|
	\| `unet_minecraft_mask_only.ckpt` \| U-Net \| Minecraft \| Mask-only \|
	\| `unet_minecraft_clean_only.ckpt` \| U-Net \| Minecraft \| Clean-only \|
	\| `unet_minecraft_full.ckpt` \| U-Net \| Minecraft \| Full \|
	\| `dit_dmlab_df.ckpt` \| DiT \| DMLab \| DF \|
	\| `dit_dmlab_mask_only.ckpt` \| DiT \| DMLab \| Mask-only \|
	\| `dit_dmlab_clean_only.ckpt` \| DiT \| DMLab \| Clean-only \|
	\| `dit_dmlab_full.ckpt` \| DiT \| DMLab \| Full \|
	\| `dit_minecraft_df.ckpt` \| DiT \| Minecraft \| DF \|
	\| `dit_minecraft_mask_only.ckpt` \| DiT \| Minecraft \| Mask-only \|
	\| `dit_minecraft_clean_only.ckpt` \| DiT \| Minecraft \| Clean-only \|
	\| `dit_minecraft_full.ckpt` \| DiT \| Minecraft \| Full \|

	All sixteen share the same optimiser, schedule, and diffusion settings
	(AdamW, lr 8×10⁻⁵, 100k steps, batch 8, fp16, cosine schedule with K=1000,
	v-prediction; DDIM with 100 steps, η=0 at inference); only the architecture
	and the training configuration differ. The U-Net is the 3D-convolutional
	Diffusion-Forcing backbone (≈18.65 M params); the DiT is a strictly
	frame-causal diffusion transformer (≈18.84 M params).

	## Ablation checkpoints (`ablation/`)

	The causal-GroupNorm ablation (paper §4 / Table 4): the U-Net trained on
	DMLab to a matched 30k-step budget, with the standard (leaky) temporal
	GroupNorm versus a frame-causal variant, under DF and Full.

	\| File \| GroupNorm \| Config \|
	\|---\|---\|---\|
	\| `ablation/unet_dmlab_leaky_df.ckpt` \| leaky (standard) \| DF \|
	\| `ablation/unet_dmlab_leaky_full.ckpt` \| leaky (standard) \| Full \|
	\| `ablation/unet_dmlab_framecausal_df.ckpt` \| frame-causal \| DF \|
	\| `ablation/unet_dmlab_framecausal_full.ckpt` \| frame-causal \| Full \|

	## Loading

	These are PyTorch-Lightning checkpoints. Load them with the matching config
	from the [code repository](https://github.com/jwei302/cct); the U-Net and
	DiT backbones and all training settings are specified there.