WorldKit / pusht-nano
A nano world model trained on the Push-T task using WorldKit.
Model Details
| Property | Value |
|---|---|
| Architecture | JEPA (Joint-Embedding Predictive Architecture) |
| Config | nano |
| Parameters | 3.5M |
| Latent Dim | 128 |
| Image Size | 96x96 |
| Action Dim | 2 (dx, dy) |
| File Size | 13.4 MB |
| Training Time | 30 seconds (Apple M4 Pro, MPS) |
| Best Val Loss | 0.4800 |
Usage
pip install worldkit
from worldkit import WorldModel
# Load this model
model = WorldModel.from_hub("DilpreetBansi/pusht-nano")
# Encode an observation
z = model.encode(observation) # -> (128,) latent vector
# Predict future states
result = model.predict(current_frame, actions)
# Plan to reach a goal
plan = model.plan(current_frame, goal_frame, max_steps=50)
# Score physical plausibility
score = model.plausibility(video_frames)
Task: Push-T
The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).
Training
Trained using WorldKit's built-in training pipeline:
from worldkit import WorldModel
model = WorldModel.train(
data="pusht_train.h5",
config="nano",
epochs=50,
batch_size=32,
lr=3e-4,
lambda_reg=0.5,
action_dim=2,
)
Architecture
Based on the LeWorldModel paper (Maes et al., 2026):
- Encoder: Vision Transformer (ViT) with CLS token pooling
- Predictor: Transformer with AdaLN-Zero conditioning on actions
- Loss: L_pred + lambda * SIGReg(Z)
- Planner: Cross-Entropy Method (CEM) in latent space
Citation
If you use this model, please cite WorldKit and the LeWorldModel paper:
@software{worldkit,
title = {WorldKit: The Open-Source World Model Runtime},
author = {Bansi, Dilpreet},
year = {2026},
url = {https://github.com/DilpreetBansi/worldkit}
}
License
MIT License. See WorldKit LICENSE.
Built with WorldKit by Dilpreet Bansi.