lepong
A 13M-parameter JEPA world model that plays Pong by watching pixels.
Architecture
- Encoder: 4-layer CNN on 128x128 RGB frames -> 192-dim embedding
- Predictor: 6-layer causal Transformer (16 heads) -> predicted next embedding
- State head: Linear(192, 10) -> ball position, velocity, paddle positions
The encoder + predictor are frozen (13M params). Only the state head trains (1,930 params).
Files
| File | Description |
|---|---|
| lepong_statehead_occ_aug.pt | Shipping checkpoint - trained with occlusion augmentation |
| lepong_statehead_frozen.pt | Baseline - trained on unoccluded frames only |
| lepong_v1.pt | Init checkpoint - encoder + predictor only, no state head |
| pong_train_30k.npz | Training data - 30K frames (128x128 RGB) + states + actions |
Results
| Metric | Value |
|---|---|
| ball_y median error (in-dist) | 2.8% |
| Controller success (in-dist) | 99.3% |
| Controller success (OOD) | 88.7% |
| ball_x improvement at 40% occ (augmented) | -58% |
Demo
Live demo: sotoalt.dev/experiments/lepong.html
Code: github.com/SotoAlt/lepong
License
MIT