WorldKit / pusht-base

A base world model trained on the Push-T task using WorldKit.

Model Details

Property	Value
Architecture	JEPA (Joint-Embedding Predictive Architecture)
Config	`base`
Parameters	13M
Latent Dim	192
Image Size	96x96
Action Dim	2 (dx, dy)
File Size	50.2 MB
Training Time	2 minutes (Apple M4 Pro, MPS)
Best Val Loss	0.3500

Usage

pip install worldkit

from worldkit import WorldModel

# Load this model
model = WorldModel.from_hub("DilpreetBansi/pusht-base")

# Encode an observation
z = model.encode(observation)  # -> (192,) latent vector

# Predict future states
result = model.predict(current_frame, actions)

# Plan to reach a goal
plan = model.plan(current_frame, goal_frame, max_steps=50)

# Score physical plausibility
score = model.plausibility(video_frames)

Task: Push-T

The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).

Training

Trained using WorldKit's built-in training pipeline:

from worldkit import WorldModel

model = WorldModel.train(
    data="pusht_train.h5",
    config="base",
    epochs=50,
    batch_size=32,
    lr=3e-4,
    lambda_reg=0.5,
    action_dim=2,
)

Architecture

Based on the LeWorldModel paper (Maes et al., 2026):

Encoder: Vision Transformer (ViT) with CLS token pooling
Predictor: Transformer with AdaLN-Zero conditioning on actions
Loss: L_pred + lambda * SIGReg(Z)
Planner: Cross-Entropy Method (CEM) in latent space

Citation

If you use this model, please cite WorldKit and the LeWorldModel paper:

@software{worldkit,
  title = {WorldKit: The Open-Source World Model Runtime},
  author = {Bansi, Dilpreet},
  year = {2026},
  url = {https://github.com/DilpreetBansi/worldkit}
}

License

MIT License. See WorldKit LICENSE.

Built with WorldKit by Dilpreet Bansi.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

DilpreetBansi
/

pusht-base