lepong

A 13M-parameter JEPA world model that plays Pong by watching pixels.

Architecture

  • Encoder: 4-layer CNN on 128x128 RGB frames -> 192-dim embedding
  • Predictor: 6-layer causal Transformer (16 heads) -> predicted next embedding
  • State head: Linear(192, 10) -> ball position, velocity, paddle positions

The encoder + predictor are frozen (13M params). Only the state head trains (1,930 params).

Files

File Description
lepong_statehead_occ_aug.pt Shipping checkpoint - trained with occlusion augmentation
lepong_statehead_frozen.pt Baseline - trained on unoccluded frames only
lepong_v1.pt Init checkpoint - encoder + predictor only, no state head
pong_train_30k.npz Training data - 30K frames (128x128 RGB) + states + actions

Results

Metric Value
ball_y median error (in-dist) 2.8%
Controller success (in-dist) 99.3%
Controller success (OOD) 88.7%
ball_x improvement at 40% occ (augmented) -58%

Demo

Live demo: sotoalt.dev/experiments/lepong.html

Code: github.com/SotoAlt/lepong

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading