π‘οΈ WorldGuard
JEPA-Inspired Video World Model for Unsupervised CCTV Anomaly Detection
Inspired by Yann LeCun's AMI Labs world model thesis β implemented locally on a single RTX A4000
π Results
Benchmark Performance (Frame-level AUROC)
| Dataset | Run 1 (ShanghaiTech only) | Run 2 (Balanced) | Change |
|---|---|---|---|
| UCSD Ped2 | 0.545 | 0.788 | +24% |
| ShanghaiTech | 0.639 | 0.614 | -2.5% |
Run 1 trained on ShanghaiTech data only. Run 2 retrained on balanced UCSD Ped2 + ShanghaiTech clips β UCSD Ped2 jumped dramatically once the model saw that scene during training. ShanghaiTech dropped slightly, a normal trade-off when splitting training capacity between two datasets.
Context vs Published Work
| Method | UCSD Ped2 AUROC |
|---|---|
| Supervised SOTA | ~0.96β0.99 |
| Unsupervised / reconstruction-based | ~0.82β0.92 |
| WorldGuard (this model, 50 epochs) | 0.788 |
WorldGuard achieves competitive unsupervised performance with zero anomaly labels during training, on a single consumer GPU in 50 epochs.
To Push Further
- Train 100+ epochs β loss was still decreasing at epoch 50
- Add more ShanghaiTech training data (currently 2077 clips; full set = 330 videos)
π§ Why This Exists β The World Model Thesis
In March 2026, Yann LeCun's AMI Labs raised $1.03 billion to build AI that goes beyond LLMs. The core thesis: real intelligence predicts abstract representations of future states, not pixels or tokens.
WorldGuard is a direct local implementation of that thesis applied to CCTV surveillance:
Train a model to predict what should happen next. When reality deviates from prediction β that's an anomaly.
Why Not Just Train a Classifier?
| Supervised Classifier | WorldGuard (World Model) | |
|---|---|---|
| Labels needed | Hundreds per class | Zero |
| Detects novel anomalies | No β known classes only | Yes β any deviation |
| Generalizes to new cameras | Poor | Better β learns scene structure |
| Failure mode | Unknown unknowns invisible | All deviations flagged |
βοΈ How It Works
CCTV Video Stream
β
βΌ
βββββββββββββββββββββββ
β Frame Extractor β 16 frames @ 224Γ224, stride 2 (~1s @ 30fps)
βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β JEPA WORLD MODEL β
β β
β Context Encoder (ViT-S/16) β
β β z_ctx [B, T_ctx, D] β
β β
β Predictor (Transformer, 4 layers) β
β β z_pred [B, T_future, D] β
β β
β Target Encoder (EMA β no gradients) β
β β z_tgt [B, T_future, D] β
β β
β Loss = L2(z_pred, z_tgt) β
β β latent space only β never pixels β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Anomaly Scorer β mean prediction error per clip
β + Spatial Heatmap β per-patch error β 224Γ224 overlay
βββββββββββββββββββββββ
The model never trains on anomalies. It learns the statistical structure of normality. At inference, anything that breaks that structure produces a spike in latent prediction error.
ποΈ Architecture
| Component | Design | VRAM |
|---|---|---|
| Context Encoder | VideoMAE-pretrained ViT-S/16 (21M params) | ~1.8 GB |
| Target Encoder | EMA copy β frozen, no gradients | ~1.8 GB |
| Predictor | 4-layer Transformer (D=384, heads=6) | ~0.4 GB |
| Activations | Batch=16, 16 frames @ 224Γ224 | ~8β10 GB |
| Total | ~13β14 GB |
π Quickstart
1. Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
2. Download checkpoint
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="irfanalii/worldguard",
filename="checkpoints/train_default_epoch050_val0.0191.pt"
)
3. Score a video
python inference/score_video.py \
--video /path/to/footage.mp4 \
--checkpoint checkpoints/train_default_epoch050_val0.0191.pt \
--camera-id cam01
4. Evaluate on UCSD Ped2
python eval/eval_roc.py \
--checkpoint checkpoints/train_default_epoch050_val0.0191.pt \
--test-dir data/ucsd_ped2/ \
--output outputs/eval/
5. Retrain on your own footage
# Extract clips from your normal CCTV recordings
python data/extract_clips.py \
--video /path/to/cam01.mp4 \
--output-dir data/train \
--camera-id cam01
# Train
python training/train.py --config configs/train_default.yaml
# Calibrate threshold
python training/calibrate.py \
--checkpoint checkpoints/best.pt \
--val-dir data/val \
--camera-id cam01
π References
- Assran et al. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- Bardes et al. (2024). V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence
π License
MIT
Built by @Irfanalee Β· Source Code Β· Inspired by LeCun's world model thesis Β· Runs entirely on local hardware