WorldKit / pusht-nano

A nano world model trained on the Push-T task using WorldKit.

Model Details

Property Value
Architecture JEPA (Joint-Embedding Predictive Architecture)
Config nano
Parameters 3.5M
Latent Dim 128
Image Size 96x96
Action Dim 2 (dx, dy)
File Size 13.4 MB
Training Time 30 seconds (Apple M4 Pro, MPS)
Best Val Loss 0.4800

Usage

pip install worldkit
from worldkit import WorldModel

# Load this model
model = WorldModel.from_hub("DilpreetBansi/pusht-nano")

# Encode an observation
z = model.encode(observation)  # -> (128,) latent vector

# Predict future states
result = model.predict(current_frame, actions)

# Plan to reach a goal
plan = model.plan(current_frame, goal_frame, max_steps=50)

# Score physical plausibility
score = model.plausibility(video_frames)

Task: Push-T

The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).

Training

Trained using WorldKit's built-in training pipeline:

from worldkit import WorldModel

model = WorldModel.train(
    data="pusht_train.h5",
    config="nano",
    epochs=50,
    batch_size=32,
    lr=3e-4,
    lambda_reg=0.5,
    action_dim=2,
)

Architecture

Based on the LeWorldModel paper (Maes et al., 2026):

  • Encoder: Vision Transformer (ViT) with CLS token pooling
  • Predictor: Transformer with AdaLN-Zero conditioning on actions
  • Loss: L_pred + lambda * SIGReg(Z)
  • Planner: Cross-Entropy Method (CEM) in latent space

Citation

If you use this model, please cite WorldKit and the LeWorldModel paper:

@software{worldkit,
  title = {WorldKit: The Open-Source World Model Runtime},
  author = {Bansi, Dilpreet},
  year = {2026},
  url = {https://github.com/DilpreetBansi/worldkit}
}

License

MIT License. See WorldKit LICENSE.


Built with WorldKit by Dilpreet Bansi.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading