WorldKit / pusht-base
A base world model trained on the Push-T task using WorldKit.
Model Details
| Property | Value |
|---|---|
| Architecture | JEPA (Joint-Embedding Predictive Architecture) |
| Config | base |
| Parameters | 13M |
| Latent Dim | 192 |
| Image Size | 96x96 |
| Action Dim | 2 (dx, dy) |
| File Size | 50.2 MB |
| Training Time | 2 minutes (Apple M4 Pro, MPS) |
| Best Val Loss | 0.3500 |
Usage
pip install worldkit
from worldkit import WorldModel
# Load this model
model = WorldModel.from_hub("DilpreetBansi/pusht-base")
# Encode an observation
z = model.encode(observation) # -> (192,) latent vector
# Predict future states
result = model.predict(current_frame, actions)
# Plan to reach a goal
plan = model.plan(current_frame, goal_frame, max_steps=50)
# Score physical plausibility
score = model.plausibility(video_frames)
Task: Push-T
The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).
Training
Trained using WorldKit's built-in training pipeline:
from worldkit import WorldModel
model = WorldModel.train(
data="pusht_train.h5",
config="base",
epochs=50,
batch_size=32,
lr=3e-4,
lambda_reg=0.5,
action_dim=2,
)
Architecture
Based on the LeWorldModel paper (Maes et al., 2026):
- Encoder: Vision Transformer (ViT) with CLS token pooling
- Predictor: Transformer with AdaLN-Zero conditioning on actions
- Loss: L_pred + lambda * SIGReg(Z)
- Planner: Cross-Entropy Method (CEM) in latent space
Citation
If you use this model, please cite WorldKit and the LeWorldModel paper:
@software{worldkit,
title = {WorldKit: The Open-Source World Model Runtime},
author = {Bansi, Dilpreet},
year = {2026},
url = {https://github.com/DilpreetBansi/worldkit}
}
License
MIT License. See WorldKit LICENSE.
Built with WorldKit by Dilpreet Bansi.