DilpreetBansi
/

pusht-base

Reinforcement Learning

Model card Files Files and versions

pusht-base / README.md

DilpreetBansi's picture

Upload README.md with huggingface_hub

29baee1 verified 10 days ago

|

history blame contribute delete

2.43 kB

	---
	license: mit
	tags:
	- worldkit
	- world-model
	- jepa
	- robotics
	- push-t
	- planning
	library_name: worldkit
	pipeline_tag: reinforcement-learning
	---

	# WorldKit / pusht-base

	A base world model trained on the Push-T task using [WorldKit](https://github.com/DilpreetBansi/worldkit).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| JEPA (Joint-Embedding Predictive Architecture) \|
	\| Config \| `base` \|
	\| Parameters \| 13M \|
	\| Latent Dim \| 192 \|
	\| Image Size \| 96x96 \|
	\| Action Dim \| 2 (dx, dy) \|
	\| File Size \| 50.2 MB \|
	\| Training Time \| 2 minutes (Apple M4 Pro, MPS) \|
	\| Best Val Loss \| 0.3500 \|

	## Usage

	```bash
	pip install worldkit
	```

	```python
	from worldkit import WorldModel

	# Load this model
	model = WorldModel.from_hub("DilpreetBansi/pusht-base")

	# Encode an observation
	z = model.encode(observation) # -> (192,) latent vector

	# Predict future states
	result = model.predict(current_frame, actions)

	# Plan to reach a goal
	plan = model.plan(current_frame, goal_frame, max_steps=50)

	# Score physical plausibility
	score = model.plausibility(video_frames)
	```

	## Task: Push-T

	The Push-T task is a 2D manipulation environment where an agent (blue circle) pushes a T-shaped block (red) toward a target position. Observations are 96x96 RGB images and actions are 2D continuous (dx, dy).

	## Training

	Trained using WorldKit's built-in training pipeline:

	```python
	from worldkit import WorldModel

	model = WorldModel.train(
	data="pusht_train.h5",
	config="base",
	epochs=50,
	batch_size=32,
	lr=3e-4,
	lambda_reg=0.5,
	action_dim=2,
	)
	```

	## Architecture

	Based on the LeWorldModel paper (Maes et al., 2026):
	- Encoder: Vision Transformer (ViT) with CLS token pooling
	- Predictor: Transformer with AdaLN-Zero conditioning on actions
	- Loss: L_pred + lambda * SIGReg(Z)
	- Planner: Cross-Entropy Method (CEM) in latent space

	## Citation

	If you use this model, please cite WorldKit and the LeWorldModel paper:

	```bibtex
	@software{worldkit,
	title = {WorldKit: The Open-Source World Model Runtime},
	author = {Bansi, Dilpreet},
	year = {2026},
	url = {https://github.com/DilpreetBansi/worldkit}
	}
	```

	## License

	MIT License. See [WorldKit LICENSE](https://github.com/DilpreetBansi/worldkit/blob/main/LICENSE).

	---

	Built with [WorldKit](https://github.com/DilpreetBansi/worldkit) by [Dilpreet Bansi](https://github.com/DilpreetBansi).