---
license: mit
tags:
- jepa
- world-model
- reinforcement-learning
- audio
- pytorch
library_name: pytorch
---

# silent — JEPA world model that plays predator by listening

A 13M-parameter Joint Embedding Predictive Architecture (JEPA) trained to
predict next-step audio embeddings on a custom predator-prey environment.
The predator senses the world through four cardioid microphones (N/E/S/W)
on its body and chooses thrust + sonar ping actions to hunt the player.

- **Live demo**: https://sotoalt.dev/experiments/silent.html
- **Code**: https://github.com/SotoAlt/silent
- **Research journal**: https://github.com/SotoAlt/silent/blob/main/docs/JOURNAL.md

## Architecture

- ViT-Tiny encoder (4-channel input, trained from scratch, ~6M params)
- Linear action encoder (frameskip x 3 -> 192)
- 6-layer AR causal transformer predictor with AdaLN-zero conditioning
- 192 -> 2048 -> 192 projector MLP with BatchNorm
- SIGReg regularizer on projected embeddings
- Jointly-trained state head MLP (192 -> 256 -> 256 -> 8) at lambda=10

Total: ~13M params. Runs at ~10 Hz on a single shared CPU vCPU.

## Files

| File                                | Purpose                                             |
|-------------------------------------|-----------------------------------------------------|
| `silent_v1_3e_ep030.pt`             | Shipping checkpoint -- joint DexWM, lambda=10        |
| `3e_ep030_head_uniform.pt`          | Post-hoc state head for planner CEM cost            |

## Quick start

```bash
pip install torch torchvision timm einops fastapi uvicorn websockets \
    librosa pymunk h5py pygame scipy

# Download checkpoints
huggingface-cli download sotoalt/silent --local-dir checkpoints/

# Clone the code
git clone https://github.com/SotoAlt/silent.git
cd silent

# Run the inference server
python -m world_model.infer_silent_env \
    --jepa-ckpt checkpoints/silent_v1_3e_ep030.pt \
    --jepa-head checkpoints/3e_ep030_head_uniform.pt \
    --host 0.0.0.0 --port 8801

# Open http://localhost:8801/ in a browser. WASD to move, space to voice.
```

## Training

The full pipeline (data generation, pure-LeWM smoke test, preflight v2
probe, joint DexWM validation gate, full 100-epoch run, post-hoc head,
ship audit) is documented in the
[research journal](https://github.com/SotoAlt/silent/blob/main/docs/JOURNAL.md)
and the [README](https://github.com/SotoAlt/silent#training-from-scratch).

## Related work

- LeWM (Maes, Le Lidec, Scieur, LeCun, Balestriero, 2026) - `arxiv 2603.19312`
- DexWM - `arxiv 2512.13644` (the joint state-head technique)
- V-JEPA 2-AC (FAIR, 2026)

## License

MIT