deepfake-server / PLAN.md
DevQueen's picture
Sync from GitHub via hub-sync
1dc2504 verified
|
Raw
History Blame Contribute Delete
23.2 kB

DeepFake Eye-Blink Detection β€” Cursor AI Build Plan

Project Overview

An Enhanced Eye-Blinking LRCN (Long-term Recurrent ConvNet) for DeepFake detection using Attentive Adversarial Training (AAT) with a Vision Transformer (ViT) backbone. The research is by Alina Chikwado Godsaves under supervision of Mr. Akanji.

The goal is to detect deepfake videos by analyzing unnatural eye-blinking patterns and fine-grained ocular artifacts (eyelid dynamics, pupil reflections) using a hybrid CNN/LSTM + ViT model hardened with adversarial training.

Python version: 3.11 (PyTorch does not install on 3.13)


What Is Already Scaffolded (Do NOT Recreate)

All files below exist but need to be verified, completed, and wired together:

Area Files Status
Config system configs/base.yaml, configs/model/lrcn_vit.yaml, configs/train/aat_pgd.yaml βœ… Exists
Data pipeline src/data/build_metadata.py, src/data/extract_frames.py, src/data/extract_eye_sequences.py, src/data/dataset.py βœ… Exists, verify
Model src/models/backbones.py, src/models/lrcn_vit.py βœ… Exists, verify
Training src/train/train.py, src/train/adversarial.py βœ… Exists, verify
Evaluation src/eval/evaluate.py, src/eval/ablation.py, src/eval/plots.py βœ… Exists, verify
Explainability src/viz/attention_maps.py βœ… Exists
Scripts scripts/run_local.sh, scripts/run_cloud.sh βœ… Exists
Docs docs/reproducibility_checklist.md, docs/results_template.md βœ… Exists

Phase 0 β€” Environment & Dependency Fix (FIRST PRIORITY)

Goal: Get a working Python 3.11 venv with all ML/CV deps installed.

Tasks

  • 0.1 Confirm python3.11 is available, or install via pyenv / system package manager
  • 0.2 Create venv: python3.11 -m venv .venv311 && source .venv311/bin/activate
  • 0.3 Pin exact working versions in requirements.txt:
    torch==2.2.2
    torchvision==0.17.2
    timm==0.9.16
    opencv-python-headless==4.9.0.80
    mediapipe==0.10.11
    pandas==2.2.2
    numpy==1.26.4
    scikit-learn==1.4.2
    matplotlib==3.8.4
    seaborn==0.13.2
    tqdm==4.66.4
    pytorch-grad-cam==1.5.0
    Pillow==10.3.0
    pyyaml==6.0.1
    albumentations==1.4.3
    einops==0.7.0
    wandb==0.17.0
    datasets==2.19.0
    huggingface_hub==0.23.0
    av==12.0.0
    
  • 0.4 Run pip install -r requirements.txt inside venv and confirm zero errors
  • 0.5 Smoke-test: python -c "import torch; import timm; import mediapipe; import datasets; print('OK')"
  • 0.6 Update scripts/run_local.sh to activate .venv311 before any python calls
  • 0.7 One-time HuggingFace login (only needed once per machine):
    huggingface-cli login
    # Paste your token from https://huggingface.co/settings/tokens
    # Token needs Read access only
    

Phase 1 β€” Dataset via HuggingFace Streaming (NO DOWNLOAD NEEDED)

Goal: Stream FaceForensics++ c23 videos directly from HuggingFace one at a time, extract eye sequences into tiny .npz files, and discard each video. No raw videos are ever stored on disk.

How Streaming Works

HuggingFace server
    β†’ sends video #1 to RAM (temp, ~5MB)
    β†’ MediaPipe extracts eye crops + EAR signal
    β†’ saves tiny .npz file (~50KB) to data/processed/
    β†’ video is gone from memory
    β†’ repeat for video #2, #3 ... #200

At the end: ~200 .npz files totalling ~100–300MB. Zero raw videos on disk.

Dataset

Source: bitmind/FaceForensicsC23 on HuggingFace
URL: https://huggingface.co/datasets/bitmind/FaceForensicsC23
Contents: 7,000 MP4 videos β€” 1,000 real + 6,000 deepfakes (Deepfakes, Face2Face, FaceShifter, FaceSwap, NeuralTextures, DeepFakeDetection), c23 compression
We use: 200 real (/Real/) + 200 fake (/Deepfakes/) = 400 videos total

Tasks

  • 1.1 Create src/data/stream_ff_dataset.py β€” a NEW script that replaces the old download-based build_metadata.py + extract_frames.py flow:
"""
Stream FaceForensics++ c23 from HuggingFace.
Downloads one video at a time into RAM, extracts eye sequences,
saves .npz files, discards the video. No raw videos stored on disk.

Usage:
    python -m src.data.stream_ff_dataset \
        --out-root data/processed \
        --num-real 200 \
        --num-fake 200
"""
import io, tempfile, os, csv
import numpy as np
import cv2
import mediapipe as mp
from datasets import load_dataset
from tqdm import tqdm
from pathlib import Path

HF_DATASET = "bitmind/FaceForensicsC23"
REAL_PATH_MARKER = "/Real/"
FAKE_PATH_MARKER = "/Deepfakes/"   # use only Deepfakes subfolder, not all 6

def compute_ear(landmarks, eye_indices):
    """Compute Eye Aspect Ratio from MediaPipe landmarks."""
    # eye_indices: [p1, p2, p3, p4, p5, p6]
    p = [landmarks[i] for i in eye_indices]
    A = np.linalg.norm(np.array([p[1].x, p[1].y]) - np.array([p[5].x, p[5].y]))
    B = np.linalg.norm(np.array([p[2].x, p[2].y]) - np.array([p[4].x, p[4].y]))
    C = np.linalg.norm(np.array([p[0].x, p[0].y]) - np.array([p[3].x, p[3].y]))
    return (A + B) / (2.0 * C + 1e-6)

def extract_sequences_from_video_bytes(video_bytes, label, video_id, seq_len=16):
    """
    Given raw video bytes, extract overlapping eye-region sequences.
    Returns list of dicts: {'frames': (T,H,W,3), 'ear': (T,), 'label': int, 'video_id': str}
    """
    # Write to a temp file so OpenCV can read it
    with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as f:
        f.write(video_bytes)
        tmp_path = f.name

    sequences = []
    face_mesh = mp.solutions.face_mesh.FaceMesh(
        static_image_mode=False, max_num_faces=1, refine_landmarks=True
    )

    # MediaPipe eye landmark indices (left eye outer→inner, right eye similar)
    LEFT_EYE = [33, 160, 158, 133, 153, 144]
    RIGHT_EYE = [362, 385, 387, 263, 373, 380]

    cap = cv2.VideoCapture(tmp_path)
    fps = cap.get(cv2.CAP_PROP_FPS) or 30
    frame_interval = max(1, int(fps / 10))  # sample at ~10fps

    all_frames, all_ears = [], []
    frame_idx = 0

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        if frame_idx % frame_interval == 0:
            rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            result = face_mesh.process(rgb)
            if result.multi_face_landmarks:
                lm = result.multi_face_landmarks[0].landmark
                h, w = frame.shape[:2]

                # Compute EAR (average of both eyes)
                left_ear = compute_ear(lm, LEFT_EYE)
                right_ear = compute_ear(lm, RIGHT_EYE)
                ear = (left_ear + right_ear) / 2.0

                # Crop eye region: bounding box around both eyes
                eye_pts = [lm[i] for i in LEFT_EYE + RIGHT_EYE]
                xs = [int(p.x * w) for p in eye_pts]
                ys = [int(p.y * h) for p in eye_pts]
                x1, x2 = max(0, min(xs) - 20), min(w, max(xs) + 20)
                y1, y2 = max(0, min(ys) - 20), min(h, max(ys) + 20)
                crop = rgb[y1:y2, x1:x2]

                if crop.size > 0:
                    crop = cv2.resize(crop, (224, 224))  # ViT input size
                    all_frames.append(crop)
                    all_ears.append(ear)

        frame_idx += 1

    cap.release()
    face_mesh.close()
    os.unlink(tmp_path)  # delete temp file immediately

    # Slice into non-overlapping sequences of length seq_len
    for i in range(0, len(all_frames) - seq_len + 1, seq_len):
        frames = np.stack(all_frames[i:i+seq_len]).astype(np.uint8)
        ears = np.array(all_ears[i:i+seq_len], dtype=np.float32)
        sequences.append({
            'frames': frames,
            'ear': ears,
            'label': label,
            'video_id': f"{video_id}_seq{i}"
        })

    return sequences


def stream_and_extract(out_root, num_real=200, num_fake=200, seq_len=16):
    out_root = Path(out_root)
    out_root.mkdir(parents=True, exist_ok=True)

    # Stream dataset β€” never downloads the full zip
    ds = load_dataset(HF_DATASET, streaming=True, split="train")

    real_count, fake_count = 0, 0
    metadata_rows = []

    pbar = tqdm(total=num_real + num_fake, desc="Streaming videos")

    for item in ds:
        video_path_str = str(item.get('video', ''))

        is_real = REAL_PATH_MARKER in video_path_str and real_count < num_real
        is_fake = FAKE_PATH_MARKER in video_path_str and fake_count < num_fake

        if not is_real and not is_fake:
            continue

        label = 0 if is_real else 1
        video_id = Path(video_path_str).stem

        # item['video'] is a dict with 'bytes' key when streaming
        video_bytes = item['video']['bytes'] if isinstance(item['video'], dict) else None
        if video_bytes is None:
            continue

        sequences = extract_sequences_from_video_bytes(
            video_bytes, label, video_id, seq_len
        )

        for seq in sequences:
            npz_name = f"{seq['video_id']}.npz"
            npz_path = out_root / npz_name
            np.savez_compressed(
                npz_path,
                frames=seq['frames'],
                ear=seq['ear'],
                label=np.array(seq['label']),
                video_id=np.array(seq['video_id'])
            )
            metadata_rows.append({
                'npz_path': str(npz_path),
                'label': label,
                'video_id': video_id,
                'split': 'train'  # will be reassigned below
            })

        if is_real:
            real_count += 1
        else:
            fake_count += 1

        pbar.update(1)

        if real_count >= num_real and fake_count >= num_fake:
            break

    pbar.close()

    # Assign splits: 70% train, 15% val, 15% test (by video_id, not sequence)
    unique_ids = list({r['video_id'] for r in metadata_rows})
    np.random.shuffle(unique_ids)
    n = len(unique_ids)
    train_ids = set(unique_ids[:int(0.7 * n)])
    val_ids = set(unique_ids[int(0.7 * n):int(0.85 * n)])

    for row in metadata_rows:
        if row['video_id'] in train_ids:
            row['split'] = 'train'
        elif row['video_id'] in val_ids:
            row['split'] = 'val'
        else:
            row['split'] = 'test'

    # Write metadata CSV
    csv_path = Path('data/metadata.csv')
    csv_path.parent.mkdir(exist_ok=True)
    with open(csv_path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['npz_path', 'label', 'video_id', 'split'])
        writer.writeheader()
        writer.writerows(metadata_rows)

    print(f"\nDone! {real_count} real + {fake_count} fake videos processed.")
    print(f"Total sequences: {len(metadata_rows)}")
    print(f"Metadata written to: {csv_path}")
    print(f"Sequences saved to: {out_root}")


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--out-root', default='data/processed')
    parser.add_argument('--num-real', type=int, default=200)
    parser.add_argument('--num-fake', type=int, default=200)
    parser.add_argument('--seq-len', type=int, default=16)
    args = parser.parse_args()
    stream_and_extract(args.out_root, args.num_real, args.num_fake, args.seq_len)
  • 1.2 Run the streaming script:

    source .venv311/bin/activate
    python -m src.data.stream_ff_dataset \
      --out-root data/processed \
      --num-real 200 \
      --num-fake 200
    

    This will run for ~20–60 minutes depending on internet speed. It streams each video, processes it, saves a tiny .npz, and moves on. Your terminal will show a progress bar.

  • 1.3 When done, verify output:

    ls data/processed/ | wc -l      # should be several hundred .npz files
    du -sh data/processed/           # should be ~100-300MB total
    python -c "
    import numpy as np
    d = np.load('data/processed/' + __import__('os').listdir('data/processed')[0], allow_pickle=True)
    print('frames:', d['frames'].shape)   # expect (16, 224, 224, 3)
    print('ear:', d['ear'].shape)         # expect (16,)
    print('label:', d['label'])           # expect 0 or 1
    "
    
  • 1.4 Verify data/metadata.csv has rows with npz_path, label, video_id, split columns and a healthy mix of train/val/test rows

  • 1.5 Update src/data/dataset.py to read from data/metadata.csv (pointing to .npz files) instead of from raw video paths. The __getitem__ contract remains unchanged:

    {'frames': Tensor[T,3,224,224], 'ear': Tensor[T], 'label': int}
    
  • 1.6 Update configs/base.yaml:

    data:
      metadata_csv: data/metadata.csv
      processed_root: data/processed
      seq_len: 16
      img_size: 224
    
  • 1.7 Add to .gitignore:

    data/processed/
    data/raw/
    data/metadata.csv
    outputs/
    *.npz
    *.pt
    .venv311/
    

Phase 2 β€” Dataset Loader Verification

Goal: Confirm src/data/dataset.py correctly reads the .npz files produced by streaming.

Tasks

  • 2.1 Open src/data/dataset.py β€” update it to read from metadata.csv instead of raw video paths. Each row's npz_path points directly to a processed sequence file.
  • 2.2 Add albumentations augmentations for training split only:
    import albumentations as A
    from albumentations.pytorch import ToTensorV2
    
    train_transform = A.Compose([
        A.HorizontalFlip(p=0.5),
        A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, p=0.5),
        A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
        A.ImageCompression(quality_lower=70, quality_upper=100, p=0.3),
        A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ToTensorV2(),
    ])
    
  • 2.3 Smoke-test the DataLoader:
    from src.data.dataset import EyeBlinkDataset
    ds = EyeBlinkDataset('data/metadata.csv', split='train')
    sample = ds[0]
    assert sample['frames'].shape == (16, 3, 224, 224)
    assert sample['ear'].shape == (16,)
    assert sample['label'] in [0, 1]
    print("DataLoader OK")
    

Phase 3 β€” Model Architecture Verification & Fix

Goal: Ensure the LRCN + ViT hybrid model is correctly implemented and matches the research proposal.

Architecture Spec (from proposal)

Input: eye-region sequence (T=16 frames, each 224Γ—224 RGB) + EAR signal (T floats)
  ↓
ViT Backbone (timm: vit_small_patch16_224, pretrained=True)
  β†’ Per-frame [CLS] token β†’ shape (T, 384)
  ↓
LSTM Temporal Encoder
  β†’ Hidden size: 256, Num layers: 2, Dropout: 0.3
  ↓
Blink Dynamics Head
  β†’ Concatenate LSTM output + EAR
  β†’ FC(257, 128) β†’ ReLU
  β†’ Blink timing constraint (0.1–0.4s window)
  ↓
Classifier Head
  β†’ FC(256, 128) β†’ ReLU β†’ Dropout(0.5) β†’ FC(128, 2)
  β†’ Output: [real_logit, fake_logit]

Tasks

  • 3.1 Open src/models/backbones.py β€” verify build_backbone(config) returns a timm ViT. For vit_small_patch16_224 embed dim = 384.
  • 3.2 Open src/models/lrcn_vit.py β€” verify forward pass. Frames arrive as (B, T, 3, 224, 224). Reshape to (B*T, 3, 224, 224) before ViT, then reshape back to (B, T, embed_dim) before LSTM.
  • 3.3 Add attention consistency loss: KL-divergence between adjacent frame ViT attention maps, weighted by lambda_attn.
  • 3.4 Add blink timing regularizer: penalize uncertain predictions when EAR < 0.2 but blink duration is outside 0.1–0.4s. Weight: lambda_blink.
  • 3.5 Add unit test in tests/test_model.py:
    model = LRCNViT(config)
    dummy = {'frames': torch.randn(2, 16, 3, 224, 224), 'ear': torch.randn(2, 16)}
    out = model(dummy)
    assert out['logits'].shape == (2, 2)
    

Phase 4 β€” Training Loop Fix & Wire-Up

Goal: Get the full training loop running end-to-end with adversarial training and all loss components.

Tasks

  • 4.1 Open src/train/train.py β€” verify it loads config, DataLoader, model, AdamW, LR scheduler, and saves outputs/best.pt on val AUC improvement.
  • 4.2 Wire in wandb: if config.wandb.enabled: true, call wandb.init() and log metrics each epoch.
  • 4.3 Total loss formula:
    L_total = L_ce(clean)
            + alpha     * L_ce(adversarial)
            + lambda_attn  * L_attn_consistency
            + lambda_blink * L_blink_regularizer
    
  • 4.4 Open src/train/adversarial.py β€” verify PGD: eps=8/255, steps=10, applied only to eye-region frames.
  • 4.5 Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  • 4.6 Update configs/train/aat_pgd.yaml:
    epochs: 30
    batch_size: 16
    lr: 3e-4
    weight_decay: 1e-4
    alpha: 0.5
    lambda_attn: 0.1
    lambda_blink: 0.05
    pgd_eps: 0.031
    pgd_steps: 10
    wandb:
      enabled: false
      project: "deepfake-eye-blink"
    
  • 4.7 Smoke-train: 2 epochs on 50 samples β€” confirm zero errors.
  • 4.8 Full training: python -m src.train.train --config configs/train/aat_pgd.yaml

Phase 5 β€” Evaluation & Ablation

Goal: Produce evaluation numbers and ablation table for the thesis.

Tasks

  • 5.1 Open src/eval/evaluate.py β€” verify it outputs Accuracy, Precision, Recall, F1, AUC.
  • 5.2 Run: python -m src.eval.evaluate --checkpoint outputs/best.pt --config configs/train/aat_pgd.yaml
  • 5.3 Open src/eval/ablation.py β€” confirm 4 configs: Full / No AAT / No ViT / No blink regularizer.
  • 5.4 Run ablation: python -m src.eval.ablation --config configs/train/aat_pgd.yaml
  • 5.5 Open src/eval/plots.py β€” confirm it generates confusion_matrix.png and roc_curve.png.
  • 5.6 Fill in docs/results_template.md with actual numbers.

Phase 6 β€” Inference API

Goal: FastAPI server that accepts an uploaded video and returns a prediction.

New files

api/
  main.py
  inference.py    # reuses the same eye extraction logic from stream_ff_dataset.py
  schemas.py
  requirements.txt

Tasks

  • 6.1 api/inference.py β€” reuse extract_sequences_from_video_bytes() from stream_ff_dataset.py. Load model once, run forward pass on all sequences, average predictions across sequences.
  • 6.2 api/main.py β€” /predict endpoint (POST, multipart file upload) + /health endpoint.
  • 6.3 Load model at startup via FastAPI lifespan, not per-request.
  • 6.4 Add CORS for http://localhost:5173.
  • 6.5 api/requirements.txt: fastapi>=0.111.0, uvicorn[standard], python-multipart>=0.0.9
  • 6.6 Test: curl -X POST http://localhost:8000/predict -F "file=@test_video.mp4"
  • 6.7 scripts/start_api.sh:
    source .venv311/bin/activate
    uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
    

Phase 7 β€” Demo Frontend

Goal: React web UI for the defence demonstration.

Stack: React + Vite + Tailwind + Recharts

frontend/
  src/
    App.jsx
    components/
      VideoUploader.jsx
      ResultCard.jsx
      FrameChart.jsx
      AttentionViewer.jsx
  index.html
  package.json
  vite.config.js

Tasks

  • 7.1 cd frontend && npm create vite@latest . -- --template react && npm install
  • 7.2 npm install tailwindcss recharts axios
  • 7.3 VideoUploader.jsx: drag-and-drop or file picker for .mp4/.avi/.mov, video preview, "Analyse Video" button, loading spinner.
  • 7.4 ResultCard.jsx: REAL (green) / FAKE (red) verdict badge, confidence %, blink rate stat.
  • 7.5 FrameChart.jsx: Recharts line chart of per-frame fake probability, frames above 0.5 highlighted red.
  • 7.6 AttentionViewer.jsx: Grad-CAM attention overlay image from API response.
  • 7.7 Proxy in vite.config.js: /predict β†’ http://localhost:8000/predict
  • 7.8 frontend/.env: VITE_API_URL=http://localhost:8000
  • 7.9 scripts/start_frontend.sh:
    cd frontend && npm run dev
    

Phase 8 β€” Integration & Final QA

  • 8.1 Run API + frontend together. Upload one of the .npz source videos as a test.
  • 8.2 Test with a real webcam recording β€” should return REAL.
  • 8.3 Fix any CORS issues.
  • 8.4 Create docs/README_DEMO.md:
    1. source .venv311/bin/activate
    2. ./scripts/start_api.sh         (Terminal 1)
    3. ./scripts/start_frontend.sh    (Terminal 2)
    4. Open http://localhost:5173
    
  • 8.5 Document exact setup commands for a fresh machine.

Project Directory Structure (Final)

deepfake-detector/
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ base.yaml
β”‚   β”œβ”€β”€ model/lrcn_vit.yaml
β”‚   └── train/aat_pgd.yaml
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ stream_ff_dataset.py   ← NEW (replaces download-based flow)
β”‚   β”‚   β”œβ”€β”€ extract_eye_sequences.py
β”‚   β”‚   └── dataset.py
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ backbones.py
β”‚   β”‚   └── lrcn_vit.py
β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ train.py
β”‚   β”‚   └── adversarial.py
β”‚   β”œβ”€β”€ eval/
β”‚   β”‚   β”œβ”€β”€ evaluate.py
β”‚   β”‚   β”œβ”€β”€ ablation.py
β”‚   β”‚   └── plots.py
β”‚   β”œβ”€β”€ viz/
β”‚   β”‚   └── attention_maps.py
β”‚   └── utils.py
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ inference.py
β”‚   β”œβ”€β”€ schemas.py
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.jsx
β”‚   β”‚   └── components/
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ package.json
β”‚   └── vite.config.js
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ processed/     ← .npz files only (~200MB), gitignored
β”‚   └── metadata.csv   ← generated, gitignored
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ best.pt
β”‚   β”œβ”€β”€ confusion_matrix.png
β”‚   └── roc_curve.png
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_local.sh
β”‚   β”œβ”€β”€ run_cloud.sh
β”‚   β”œβ”€β”€ start_api.sh
β”‚   └── start_frontend.sh
β”œβ”€β”€ tests/
β”‚   └── test_model.py
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ reproducibility_checklist.md
β”‚   β”œβ”€β”€ results_template.md
β”‚   └── README_DEMO.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
└── README.md

Suggestions & Overrides

⚠️ Old files to DEPRECATE (keep but do not use)

src/data/build_metadata.py and src/data/extract_frames.py were written for a local download workflow. They are superseded by stream_ff_dataset.py. Keep them in the repo for reference but do not run them.

⚠️ ViT Input Resolution

Frames are extracted at 224Γ—224 directly in the streaming script. No resizing needed elsewhere.

⚠️ Internet Required for Phase 1

The streaming script needs internet during the ~20–60 min preprocessing run. After that, everything runs offline from the .npz files.

⚠️ Pre-trained Checkpoint Option

Use timm's pretrained ViT weights (ImageNet). Fine-tuning for 5–10 epochs on 400 videos is sufficient for a compelling defence demo.

βœ… Frontend: Keep it Simple

Single-page upload β†’ result. No auth, no database needed.