sotoalt
/

lepong

Reinforcement Learning

computer-vision

Model card Files Files and versions

lepong

A 13M-parameter JEPA world model that plays Pong by watching pixels.

Architecture

Encoder: 4-layer CNN on 128x128 RGB frames -> 192-dim embedding
Predictor: 6-layer causal Transformer (16 heads) -> predicted next embedding
State head: Linear(192, 10) -> ball position, velocity, paddle positions

The encoder + predictor are frozen (13M params). Only the state head trains (1,930 params).

Files

File	Description
lepong_statehead_occ_aug.pt	Shipping checkpoint - trained with occlusion augmentation
lepong_statehead_frozen.pt	Baseline - trained on unoccluded frames only
lepong_v1.pt	Init checkpoint - encoder + predictor only, no state head
pong_train_30k.npz	Training data - 30K frames (128x128 RGB) + states + actions

Results

Metric	Value
ball_y median error (in-dist)	2.8%
Controller success (in-dist)	99.3%
Controller success (OOD)	88.7%
ball_x improvement at 40% occ (augmented)	-58%

Demo

Live demo: sotoalt.dev/experiments/lepong.html

Code: github.com/SotoAlt/lepong

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

loading