world model
latent action
video generation