WorldKit / pusht-base

A base world model trained on the Push-T task using WorldKit.

Model Details

Property Value
Architecture JEPA (Joint-Embedding Predictive Architecture)
Config base
Parameters 13M
Latent Dim 192
Image Size 96x96
Action Dim 2 (dx, dy)
File Size 50.2 MB
Training Time 2 minutes (Apple M4 Pro, MPS)
Best Val Loss 0.3500

Usage

pip install worldkit
from worldkit import WorldModel

# Load this model
model = WorldModel.from_hub("DilpreetBansi/pusht-base")

# Encode an observation
z = model.encode(observation)  # -> (192,) latent vector

# Predict future states
result = model.predict(current_frame, actions)

# Plan to reach a goal
plan = model.plan(current_frame, goal_frame, max_steps=50)

# Score physical plausibility
score = model.plausibility(video_frames)

Task: Push-T

The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).

Training

Trained using WorldKit's built-in training pipeline:

from worldkit import WorldModel

model = WorldModel.train(
    data="pusht_train.h5",
    config="base",
    epochs=50,
    batch_size=32,
    lr=3e-4,
    lambda_reg=0.5,
    action_dim=2,
)

Architecture

Based on the LeWorldModel paper (Maes et al., 2026):

  • Encoder: Vision Transformer (ViT) with CLS token pooling
  • Predictor: Transformer with AdaLN-Zero conditioning on actions
  • Loss: L_pred + lambda * SIGReg(Z)
  • Planner: Cross-Entropy Method (CEM) in latent space

Citation

If you use this model, please cite WorldKit and the LeWorldModel paper:

@software{worldkit,
  title = {WorldKit: The Open-Source World Model Runtime},
  author = {Bansi, Dilpreet},
  year = {2026},
  url = {https://github.com/DilpreetBansi/worldkit}
}

License

MIT License. See WorldKit LICENSE.


Built with WorldKit by Dilpreet Bansi.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Space using DilpreetBansi/pusht-base 1