Jasmine Diffusion Checkpoint

Architecture: ST-DiT (spatio-temporal diffusion transformer)
Input: 16-frame sequences (64×64) + latent actions
Training Environment: CoinRun (Cobbe et al., 2020)
Objective: Diffusion forcing (x-prediction)

Pretrained diffusion-based world model from the Jasmine codebase.
Trained on the CoinRun dataset for action-conditioned video generation using the diffusion-forcing objective (Chen et al., 2024).