WorldKit / pusht-nano

A nano world model trained on the Push-T task using WorldKit.

Model Details

Property	Value
Architecture	JEPA (Joint-Embedding Predictive Architecture)
Config	`nano`
Parameters	3.5M
Latent Dim	128
Image Size	96x96
Action Dim	2 (dx, dy)
File Size	13.4 MB
Training Time	30 seconds (Apple M4 Pro, MPS)
Best Val Loss	0.4800

Usage

pip install worldkit

from worldkit import WorldModel

# Load this model
model = WorldModel.from_hub("DilpreetBansi/pusht-nano")

# Encode an observation
z = model.encode(observation)  # -> (128,) latent vector

# Predict future states
result = model.predict(current_frame, actions)

# Plan to reach a goal
plan = model.plan(current_frame, goal_frame, max_steps=50)

# Score physical plausibility
score = model.plausibility(video_frames)

Task: Push-T

The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).

Training

Trained using WorldKit's built-in training pipeline:

from worldkit import WorldModel

model = WorldModel.train(
    data="pusht_train.h5",
    config="nano",
    epochs=50,
    batch_size=32,
    lr=3e-4,
    lambda_reg=0.5,
    action_dim=2,
)

Architecture

Based on the LeWorldModel paper (Maes et al., 2026):

Encoder: Vision Transformer (ViT) with CLS token pooling
Predictor: Transformer with AdaLN-Zero conditioning on actions
Loss: L_pred + lambda * SIGReg(Z)
Planner: Cross-Entropy Method (CEM) in latent space

Citation

If you use this model, please cite WorldKit and the LeWorldModel paper:

@software{worldkit,
  title = {WorldKit: The Open-Source World Model Runtime},
  author = {Bansi, Dilpreet},
  year = {2026},
  url = {https://github.com/DilpreetBansi/worldkit}
}

License

MIT License. See WorldKit LICENSE.

Built with WorldKit by Dilpreet Bansi.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning