πŸ›‘οΈ WorldGuard

JEPA-Inspired Video World Model for Unsupervised CCTV Anomaly Detection

Inspired by Yann LeCun's AMI Labs world model thesis β€” implemented locally on a single RTX A4000


Release Model Paradigm Labels Python PyTorch CUDA License


πŸ“Š Results

Benchmark Performance (Frame-level AUROC)

Dataset Run 1 (ShanghaiTech only) Run 2 (Balanced) Change
UCSD Ped2 0.545 0.788 +24%
ShanghaiTech 0.639 0.614 -2.5%

Run 1 trained on ShanghaiTech data only. Run 2 retrained on balanced UCSD Ped2 + ShanghaiTech clips β€” UCSD Ped2 jumped dramatically once the model saw that scene during training. ShanghaiTech dropped slightly, a normal trade-off when splitting training capacity between two datasets.

Context vs Published Work

Method UCSD Ped2 AUROC
Supervised SOTA ~0.96–0.99
Unsupervised / reconstruction-based ~0.82–0.92
WorldGuard (this model, 50 epochs) 0.788

WorldGuard achieves competitive unsupervised performance with zero anomaly labels during training, on a single consumer GPU in 50 epochs.

To Push Further

  • Train 100+ epochs β€” loss was still decreasing at epoch 50
  • Add more ShanghaiTech training data (currently 2077 clips; full set = 330 videos)

🧠 Why This Exists β€” The World Model Thesis

In March 2026, Yann LeCun's AMI Labs raised $1.03 billion to build AI that goes beyond LLMs. The core thesis: real intelligence predicts abstract representations of future states, not pixels or tokens.

WorldGuard is a direct local implementation of that thesis applied to CCTV surveillance:

Train a model to predict what should happen next. When reality deviates from prediction β€” that's an anomaly.

Why Not Just Train a Classifier?

Supervised Classifier WorldGuard (World Model)
Labels needed Hundreds per class Zero
Detects novel anomalies No β€” known classes only Yes β€” any deviation
Generalizes to new cameras Poor Better β€” learns scene structure
Failure mode Unknown unknowns invisible All deviations flagged

βš™οΈ How It Works

CCTV Video Stream
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frame Extractor   β”‚  16 frames @ 224Γ—224, stride 2 (~1s @ 30fps)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           JEPA WORLD MODEL               β”‚
β”‚                                          β”‚
β”‚  Context Encoder (ViT-S/16)              β”‚
β”‚       β†’ z_ctx  [B, T_ctx, D]             β”‚
β”‚                                          β”‚
β”‚  Predictor (Transformer, 4 layers)       β”‚
β”‚       β†’ z_pred [B, T_future, D]          β”‚
β”‚                                          β”‚
β”‚  Target Encoder (EMA β€” no gradients)     β”‚
β”‚       β†’ z_tgt  [B, T_future, D]          β”‚
β”‚                                          β”‚
β”‚  Loss = L2(z_pred, z_tgt)               β”‚
β”‚  ↑ latent space only β€” never pixels      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Anomaly Scorer    β”‚  mean prediction error per clip
β”‚   + Spatial Heatmap β”‚  per-patch error β†’ 224Γ—224 overlay
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The model never trains on anomalies. It learns the statistical structure of normality. At inference, anything that breaks that structure produces a spike in latent prediction error.


πŸ—οΈ Architecture

Component Design VRAM
Context Encoder VideoMAE-pretrained ViT-S/16 (21M params) ~1.8 GB
Target Encoder EMA copy β€” frozen, no gradients ~1.8 GB
Predictor 4-layer Transformer (D=384, heads=6) ~0.4 GB
Activations Batch=16, 16 frames @ 224Γ—224 ~8–10 GB
Total ~13–14 GB

πŸš€ Quickstart

1. Install dependencies

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

2. Download checkpoint

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="irfanalii/worldguard",
    filename="checkpoints/train_default_epoch050_val0.0191.pt"
)

3. Score a video

python inference/score_video.py \
  --video /path/to/footage.mp4 \
  --checkpoint checkpoints/train_default_epoch050_val0.0191.pt \
  --camera-id cam01

4. Evaluate on UCSD Ped2

python eval/eval_roc.py \
  --checkpoint checkpoints/train_default_epoch050_val0.0191.pt \
  --test-dir data/ucsd_ped2/ \
  --output outputs/eval/

5. Retrain on your own footage

# Extract clips from your normal CCTV recordings
python data/extract_clips.py \
  --video /path/to/cam01.mp4 \
  --output-dir data/train \
  --camera-id cam01

# Train
python training/train.py --config configs/train_default.yaml

# Calibrate threshold
python training/calibrate.py \
  --checkpoint checkpoints/best.pt \
  --val-dir data/val \
  --camera-id cam01

πŸ“š References


πŸ“„ License

MIT


Built by @Irfanalee Β· Source Code Β· Inspired by LeCun's world model thesis Β· Runs entirely on local hardware

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for irfanalee/worldguard