deepfake-server / PLAN.md
DevQueen's picture
Sync from GitHub via hub-sync
1dc2504 verified
|
Raw
History Blame Contribute Delete
23.2 kB
# DeepFake Eye-Blink Detection β€” Cursor AI Build Plan
## Project Overview
An Enhanced Eye-Blinking LRCN (Long-term Recurrent ConvNet) for DeepFake detection using Attentive Adversarial Training (AAT) with a Vision Transformer (ViT) backbone. The research is by **Alina Chikwado Godsaves** under supervision of **Mr. Akanji**.
The goal is to detect deepfake videos by analyzing unnatural eye-blinking patterns and fine-grained ocular artifacts (eyelid dynamics, pupil reflections) using a hybrid CNN/LSTM + ViT model hardened with adversarial training.
**Python version: 3.11** (PyTorch does not install on 3.13)
---
## What Is Already Scaffolded (Do NOT Recreate)
All files below exist but need to be **verified, completed, and wired together**:
| Area | Files | Status |
|------|-------|--------|
| Config system | `configs/base.yaml`, `configs/model/lrcn_vit.yaml`, `configs/train/aat_pgd.yaml` | βœ… Exists |
| Data pipeline | `src/data/build_metadata.py`, `src/data/extract_frames.py`, `src/data/extract_eye_sequences.py`, `src/data/dataset.py` | βœ… Exists, verify |
| Model | `src/models/backbones.py`, `src/models/lrcn_vit.py` | βœ… Exists, verify |
| Training | `src/train/train.py`, `src/train/adversarial.py` | βœ… Exists, verify |
| Evaluation | `src/eval/evaluate.py`, `src/eval/ablation.py`, `src/eval/plots.py` | βœ… Exists, verify |
| Explainability | `src/viz/attention_maps.py` | βœ… Exists |
| Scripts | `scripts/run_local.sh`, `scripts/run_cloud.sh` | βœ… Exists |
| Docs | `docs/reproducibility_checklist.md`, `docs/results_template.md` | βœ… Exists |
---
## Phase 0 β€” Environment & Dependency Fix (FIRST PRIORITY)
**Goal:** Get a working Python 3.11 venv with all ML/CV deps installed.
### Tasks
- [ ] **0.1** Confirm `python3.11` is available, or install via `pyenv` / system package manager
- [ ] **0.2** Create venv: `python3.11 -m venv .venv311 && source .venv311/bin/activate`
- [ ] **0.3** Pin exact working versions in `requirements.txt`:
```
torch==2.2.2
torchvision==0.17.2
timm==0.9.16
opencv-python-headless==4.9.0.80
mediapipe==0.10.11
pandas==2.2.2
numpy==1.26.4
scikit-learn==1.4.2
matplotlib==3.8.4
seaborn==0.13.2
tqdm==4.66.4
pytorch-grad-cam==1.5.0
Pillow==10.3.0
pyyaml==6.0.1
albumentations==1.4.3
einops==0.7.0
wandb==0.17.0
datasets==2.19.0
huggingface_hub==0.23.0
av==12.0.0
```
- [ ] **0.4** Run `pip install -r requirements.txt` inside venv and confirm zero errors
- [ ] **0.5** Smoke-test: `python -c "import torch; import timm; import mediapipe; import datasets; print('OK')"`
- [ ] **0.6** Update `scripts/run_local.sh` to activate `.venv311` before any python calls
- [ ] **0.7** One-time HuggingFace login (only needed once per machine):
```bash
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens
# Token needs Read access only
```
---
## Phase 1 β€” Dataset via HuggingFace Streaming (NO DOWNLOAD NEEDED)
**Goal:** Stream FaceForensics++ c23 videos directly from HuggingFace one at a time, extract eye sequences into tiny `.npz` files, and discard each video. No raw videos are ever stored on disk.
### How Streaming Works
```
HuggingFace server
β†’ sends video #1 to RAM (temp, ~5MB)
β†’ MediaPipe extracts eye crops + EAR signal
β†’ saves tiny .npz file (~50KB) to data/processed/
β†’ video is gone from memory
β†’ repeat for video #2, #3 ... #200
```
At the end: ~200 `.npz` files totalling ~100–300MB. Zero raw videos on disk.
### Dataset
**Source:** `bitmind/FaceForensicsC23` on HuggingFace
**URL:** https://huggingface.co/datasets/bitmind/FaceForensicsC23
**Contents:** 7,000 MP4 videos β€” 1,000 real + 6,000 deepfakes (Deepfakes, Face2Face, FaceShifter, FaceSwap, NeuralTextures, DeepFakeDetection), c23 compression
**We use:** 200 real (`/Real/`) + 200 fake (`/Deepfakes/`) = 400 videos total
### Tasks
- [ ] **1.1** Create `src/data/stream_ff_dataset.py` β€” a NEW script that replaces the old download-based `build_metadata.py` + `extract_frames.py` flow:
```python
"""
Stream FaceForensics++ c23 from HuggingFace.
Downloads one video at a time into RAM, extracts eye sequences,
saves .npz files, discards the video. No raw videos stored on disk.
Usage:
python -m src.data.stream_ff_dataset \
--out-root data/processed \
--num-real 200 \
--num-fake 200
"""
import io, tempfile, os, csv
import numpy as np
import cv2
import mediapipe as mp
from datasets import load_dataset
from tqdm import tqdm
from pathlib import Path
HF_DATASET = "bitmind/FaceForensicsC23"
REAL_PATH_MARKER = "/Real/"
FAKE_PATH_MARKER = "/Deepfakes/" # use only Deepfakes subfolder, not all 6
def compute_ear(landmarks, eye_indices):
"""Compute Eye Aspect Ratio from MediaPipe landmarks."""
# eye_indices: [p1, p2, p3, p4, p5, p6]
p = [landmarks[i] for i in eye_indices]
A = np.linalg.norm(np.array([p[1].x, p[1].y]) - np.array([p[5].x, p[5].y]))
B = np.linalg.norm(np.array([p[2].x, p[2].y]) - np.array([p[4].x, p[4].y]))
C = np.linalg.norm(np.array([p[0].x, p[0].y]) - np.array([p[3].x, p[3].y]))
return (A + B) / (2.0 * C + 1e-6)
def extract_sequences_from_video_bytes(video_bytes, label, video_id, seq_len=16):
"""
Given raw video bytes, extract overlapping eye-region sequences.
Returns list of dicts: {'frames': (T,H,W,3), 'ear': (T,), 'label': int, 'video_id': str}
"""
# Write to a temp file so OpenCV can read it
with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as f:
f.write(video_bytes)
tmp_path = f.name
sequences = []
face_mesh = mp.solutions.face_mesh.FaceMesh(
static_image_mode=False, max_num_faces=1, refine_landmarks=True
)
# MediaPipe eye landmark indices (left eye outer→inner, right eye similar)
LEFT_EYE = [33, 160, 158, 133, 153, 144]
RIGHT_EYE = [362, 385, 387, 263, 373, 380]
cap = cv2.VideoCapture(tmp_path)
fps = cap.get(cv2.CAP_PROP_FPS) or 30
frame_interval = max(1, int(fps / 10)) # sample at ~10fps
all_frames, all_ears = [], []
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
if frame_idx % frame_interval == 0:
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
result = face_mesh.process(rgb)
if result.multi_face_landmarks:
lm = result.multi_face_landmarks[0].landmark
h, w = frame.shape[:2]
# Compute EAR (average of both eyes)
left_ear = compute_ear(lm, LEFT_EYE)
right_ear = compute_ear(lm, RIGHT_EYE)
ear = (left_ear + right_ear) / 2.0
# Crop eye region: bounding box around both eyes
eye_pts = [lm[i] for i in LEFT_EYE + RIGHT_EYE]
xs = [int(p.x * w) for p in eye_pts]
ys = [int(p.y * h) for p in eye_pts]
x1, x2 = max(0, min(xs) - 20), min(w, max(xs) + 20)
y1, y2 = max(0, min(ys) - 20), min(h, max(ys) + 20)
crop = rgb[y1:y2, x1:x2]
if crop.size > 0:
crop = cv2.resize(crop, (224, 224)) # ViT input size
all_frames.append(crop)
all_ears.append(ear)
frame_idx += 1
cap.release()
face_mesh.close()
os.unlink(tmp_path) # delete temp file immediately
# Slice into non-overlapping sequences of length seq_len
for i in range(0, len(all_frames) - seq_len + 1, seq_len):
frames = np.stack(all_frames[i:i+seq_len]).astype(np.uint8)
ears = np.array(all_ears[i:i+seq_len], dtype=np.float32)
sequences.append({
'frames': frames,
'ear': ears,
'label': label,
'video_id': f"{video_id}_seq{i}"
})
return sequences
def stream_and_extract(out_root, num_real=200, num_fake=200, seq_len=16):
out_root = Path(out_root)
out_root.mkdir(parents=True, exist_ok=True)
# Stream dataset β€” never downloads the full zip
ds = load_dataset(HF_DATASET, streaming=True, split="train")
real_count, fake_count = 0, 0
metadata_rows = []
pbar = tqdm(total=num_real + num_fake, desc="Streaming videos")
for item in ds:
video_path_str = str(item.get('video', ''))
is_real = REAL_PATH_MARKER in video_path_str and real_count < num_real
is_fake = FAKE_PATH_MARKER in video_path_str and fake_count < num_fake
if not is_real and not is_fake:
continue
label = 0 if is_real else 1
video_id = Path(video_path_str).stem
# item['video'] is a dict with 'bytes' key when streaming
video_bytes = item['video']['bytes'] if isinstance(item['video'], dict) else None
if video_bytes is None:
continue
sequences = extract_sequences_from_video_bytes(
video_bytes, label, video_id, seq_len
)
for seq in sequences:
npz_name = f"{seq['video_id']}.npz"
npz_path = out_root / npz_name
np.savez_compressed(
npz_path,
frames=seq['frames'],
ear=seq['ear'],
label=np.array(seq['label']),
video_id=np.array(seq['video_id'])
)
metadata_rows.append({
'npz_path': str(npz_path),
'label': label,
'video_id': video_id,
'split': 'train' # will be reassigned below
})
if is_real:
real_count += 1
else:
fake_count += 1
pbar.update(1)
if real_count >= num_real and fake_count >= num_fake:
break
pbar.close()
# Assign splits: 70% train, 15% val, 15% test (by video_id, not sequence)
unique_ids = list({r['video_id'] for r in metadata_rows})
np.random.shuffle(unique_ids)
n = len(unique_ids)
train_ids = set(unique_ids[:int(0.7 * n)])
val_ids = set(unique_ids[int(0.7 * n):int(0.85 * n)])
for row in metadata_rows:
if row['video_id'] in train_ids:
row['split'] = 'train'
elif row['video_id'] in val_ids:
row['split'] = 'val'
else:
row['split'] = 'test'
# Write metadata CSV
csv_path = Path('data/metadata.csv')
csv_path.parent.mkdir(exist_ok=True)
with open(csv_path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['npz_path', 'label', 'video_id', 'split'])
writer.writeheader()
writer.writerows(metadata_rows)
print(f"\nDone! {real_count} real + {fake_count} fake videos processed.")
print(f"Total sequences: {len(metadata_rows)}")
print(f"Metadata written to: {csv_path}")
print(f"Sequences saved to: {out_root}")
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--out-root', default='data/processed')
parser.add_argument('--num-real', type=int, default=200)
parser.add_argument('--num-fake', type=int, default=200)
parser.add_argument('--seq-len', type=int, default=16)
args = parser.parse_args()
stream_and_extract(args.out_root, args.num_real, args.num_fake, args.seq_len)
```
- [ ] **1.2** Run the streaming script:
```bash
source .venv311/bin/activate
python -m src.data.stream_ff_dataset \
--out-root data/processed \
--num-real 200 \
--num-fake 200
```
This will run for ~20–60 minutes depending on internet speed. It streams each video, processes it, saves a tiny `.npz`, and moves on. Your terminal will show a progress bar.
- [ ] **1.3** When done, verify output:
```bash
ls data/processed/ | wc -l # should be several hundred .npz files
du -sh data/processed/ # should be ~100-300MB total
python -c "
import numpy as np
d = np.load('data/processed/' + __import__('os').listdir('data/processed')[0], allow_pickle=True)
print('frames:', d['frames'].shape) # expect (16, 224, 224, 3)
print('ear:', d['ear'].shape) # expect (16,)
print('label:', d['label']) # expect 0 or 1
"
```
- [ ] **1.4** Verify `data/metadata.csv` has rows with `npz_path`, `label`, `video_id`, `split` columns and a healthy mix of train/val/test rows
- [ ] **1.5** Update `src/data/dataset.py` to read from `data/metadata.csv` (pointing to `.npz` files) instead of from raw video paths. The `__getitem__` contract remains unchanged:
```python
{'frames': Tensor[T,3,224,224], 'ear': Tensor[T], 'label': int}
```
- [ ] **1.6** Update `configs/base.yaml`:
```yaml
data:
metadata_csv: data/metadata.csv
processed_root: data/processed
seq_len: 16
img_size: 224
```
- [ ] **1.7** Add to `.gitignore`:
```
data/processed/
data/raw/
data/metadata.csv
outputs/
*.npz
*.pt
.venv311/
```
---
## Phase 2 β€” Dataset Loader Verification
**Goal:** Confirm `src/data/dataset.py` correctly reads the `.npz` files produced by streaming.
### Tasks
- [ ] **2.1** Open `src/data/dataset.py` β€” update it to read from `metadata.csv` instead of raw video paths. Each row's `npz_path` points directly to a processed sequence file.
- [ ] **2.2** Add `albumentations` augmentations for training split only:
```python
import albumentations as A
from albumentations.pytorch import ToTensorV2
train_transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, p=0.5),
A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
A.ImageCompression(quality_lower=70, quality_upper=100, p=0.3),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
```
- [ ] **2.3** Smoke-test the DataLoader:
```python
from src.data.dataset import EyeBlinkDataset
ds = EyeBlinkDataset('data/metadata.csv', split='train')
sample = ds[0]
assert sample['frames'].shape == (16, 3, 224, 224)
assert sample['ear'].shape == (16,)
assert sample['label'] in [0, 1]
print("DataLoader OK")
```
---
## Phase 3 β€” Model Architecture Verification & Fix
**Goal:** Ensure the LRCN + ViT hybrid model is correctly implemented and matches the research proposal.
### Architecture Spec (from proposal)
```
Input: eye-region sequence (T=16 frames, each 224Γ—224 RGB) + EAR signal (T floats)
↓
ViT Backbone (timm: vit_small_patch16_224, pretrained=True)
β†’ Per-frame [CLS] token β†’ shape (T, 384)
↓
LSTM Temporal Encoder
β†’ Hidden size: 256, Num layers: 2, Dropout: 0.3
↓
Blink Dynamics Head
β†’ Concatenate LSTM output + EAR
β†’ FC(257, 128) β†’ ReLU
β†’ Blink timing constraint (0.1–0.4s window)
↓
Classifier Head
β†’ FC(256, 128) β†’ ReLU β†’ Dropout(0.5) β†’ FC(128, 2)
β†’ Output: [real_logit, fake_logit]
```
### Tasks
- [ ] **3.1** Open `src/models/backbones.py` β€” verify `build_backbone(config)` returns a timm ViT. For `vit_small_patch16_224` embed dim = 384.
- [ ] **3.2** Open `src/models/lrcn_vit.py` β€” verify forward pass. Frames arrive as `(B, T, 3, 224, 224)`. Reshape to `(B*T, 3, 224, 224)` before ViT, then reshape back to `(B, T, embed_dim)` before LSTM.
- [ ] **3.3** Add **attention consistency loss**: KL-divergence between adjacent frame ViT attention maps, weighted by `lambda_attn`.
- [ ] **3.4** Add **blink timing regularizer**: penalize uncertain predictions when EAR < 0.2 but blink duration is outside 0.1–0.4s. Weight: `lambda_blink`.
- [ ] **3.5** Add unit test in `tests/test_model.py`:
```python
model = LRCNViT(config)
dummy = {'frames': torch.randn(2, 16, 3, 224, 224), 'ear': torch.randn(2, 16)}
out = model(dummy)
assert out['logits'].shape == (2, 2)
```
---
## Phase 4 β€” Training Loop Fix & Wire-Up
**Goal:** Get the full training loop running end-to-end with adversarial training and all loss components.
### Tasks
- [ ] **4.1** Open `src/train/train.py` β€” verify it loads config, DataLoader, model, AdamW, LR scheduler, and saves `outputs/best.pt` on val AUC improvement.
- [ ] **4.2** **Wire in `wandb`**: if `config.wandb.enabled: true`, call `wandb.init()` and log metrics each epoch.
- [ ] **4.3** Total loss formula:
```
L_total = L_ce(clean)
+ alpha * L_ce(adversarial)
+ lambda_attn * L_attn_consistency
+ lambda_blink * L_blink_regularizer
```
- [ ] **4.4** Open `src/train/adversarial.py` β€” verify PGD: `eps=8/255`, `steps=10`, applied only to eye-region frames.
- [ ] **4.5** Add gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`
- [ ] **4.6** Update `configs/train/aat_pgd.yaml`:
```yaml
epochs: 30
batch_size: 16
lr: 3e-4
weight_decay: 1e-4
alpha: 0.5
lambda_attn: 0.1
lambda_blink: 0.05
pgd_eps: 0.031
pgd_steps: 10
wandb:
enabled: false
project: "deepfake-eye-blink"
```
- [ ] **4.7** Smoke-train: 2 epochs on 50 samples β€” confirm zero errors.
- [ ] **4.8** Full training: `python -m src.train.train --config configs/train/aat_pgd.yaml`
---
## Phase 5 β€” Evaluation & Ablation
**Goal:** Produce evaluation numbers and ablation table for the thesis.
### Tasks
- [ ] **5.1** Open `src/eval/evaluate.py` β€” verify it outputs Accuracy, Precision, Recall, F1, AUC.
- [ ] **5.2** Run: `python -m src.eval.evaluate --checkpoint outputs/best.pt --config configs/train/aat_pgd.yaml`
- [ ] **5.3** Open `src/eval/ablation.py` β€” confirm 4 configs: Full / No AAT / No ViT / No blink regularizer.
- [ ] **5.4** Run ablation: `python -m src.eval.ablation --config configs/train/aat_pgd.yaml`
- [ ] **5.5** Open `src/eval/plots.py` β€” confirm it generates `confusion_matrix.png` and `roc_curve.png`.
- [ ] **5.6** Fill in `docs/results_template.md` with actual numbers.
---
## Phase 6 β€” Inference API
**Goal:** FastAPI server that accepts an uploaded video and returns a prediction.
### New files
```
api/
main.py
inference.py # reuses the same eye extraction logic from stream_ff_dataset.py
schemas.py
requirements.txt
```
### Tasks
- [ ] **6.1** `api/inference.py` β€” reuse `extract_sequences_from_video_bytes()` from `stream_ff_dataset.py`. Load model once, run forward pass on all sequences, average predictions across sequences.
- [ ] **6.2** `api/main.py` β€” `/predict` endpoint (POST, multipart file upload) + `/health` endpoint.
- [ ] **6.3** Load model at startup via FastAPI `lifespan`, not per-request.
- [ ] **6.4** Add CORS for `http://localhost:5173`.
- [ ] **6.5** `api/requirements.txt`: `fastapi>=0.111.0`, `uvicorn[standard]`, `python-multipart>=0.0.9`
- [ ] **6.6** Test: `curl -X POST http://localhost:8000/predict -F "file=@test_video.mp4"`
- [ ] **6.7** `scripts/start_api.sh`:
```bash
source .venv311/bin/activate
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
```
---
## Phase 7 β€” Demo Frontend
**Goal:** React web UI for the defence demonstration.
### Stack: React + Vite + Tailwind + Recharts
```
frontend/
src/
App.jsx
components/
VideoUploader.jsx
ResultCard.jsx
FrameChart.jsx
AttentionViewer.jsx
index.html
package.json
vite.config.js
```
### Tasks
- [ ] **7.1** `cd frontend && npm create vite@latest . -- --template react && npm install`
- [ ] **7.2** `npm install tailwindcss recharts axios`
- [ ] **7.3** `VideoUploader.jsx`: drag-and-drop or file picker for `.mp4/.avi/.mov`, video preview, "Analyse Video" button, loading spinner.
- [ ] **7.4** `ResultCard.jsx`: REAL (green) / FAKE (red) verdict badge, confidence %, blink rate stat.
- [ ] **7.5** `FrameChart.jsx`: Recharts line chart of per-frame fake probability, frames above 0.5 highlighted red.
- [ ] **7.6** `AttentionViewer.jsx`: Grad-CAM attention overlay image from API response.
- [ ] **7.7** Proxy in `vite.config.js`: `/predict` β†’ `http://localhost:8000/predict`
- [ ] **7.8** `frontend/.env`: `VITE_API_URL=http://localhost:8000`
- [ ] **7.9** `scripts/start_frontend.sh`:
```bash
cd frontend && npm run dev
```
---
## Phase 8 β€” Integration & Final QA
- [ ] **8.1** Run API + frontend together. Upload one of the `.npz` source videos as a test.
- [ ] **8.2** Test with a real webcam recording β€” should return REAL.
- [ ] **8.3** Fix any CORS issues.
- [ ] **8.4** Create `docs/README_DEMO.md`:
```
1. source .venv311/bin/activate
2. ./scripts/start_api.sh (Terminal 1)
3. ./scripts/start_frontend.sh (Terminal 2)
4. Open http://localhost:5173
```
- [ ] **8.5** Document exact setup commands for a fresh machine.
---
## Project Directory Structure (Final)
```
deepfake-detector/
β”œβ”€β”€ configs/
β”‚ β”œβ”€β”€ base.yaml
β”‚ β”œβ”€β”€ model/lrcn_vit.yaml
β”‚ └── train/aat_pgd.yaml
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ data/
β”‚ β”‚ β”œβ”€β”€ stream_ff_dataset.py ← NEW (replaces download-based flow)
β”‚ β”‚ β”œβ”€β”€ extract_eye_sequences.py
β”‚ β”‚ └── dataset.py
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”œβ”€β”€ backbones.py
β”‚ β”‚ └── lrcn_vit.py
β”‚ β”œβ”€β”€ train/
β”‚ β”‚ β”œβ”€β”€ train.py
β”‚ β”‚ └── adversarial.py
β”‚ β”œβ”€β”€ eval/
β”‚ β”‚ β”œβ”€β”€ evaluate.py
β”‚ β”‚ β”œβ”€β”€ ablation.py
β”‚ β”‚ └── plots.py
β”‚ β”œβ”€β”€ viz/
β”‚ β”‚ └── attention_maps.py
β”‚ └── utils.py
β”œβ”€β”€ api/
β”‚ β”œβ”€β”€ main.py
β”‚ β”œβ”€β”€ inference.py
β”‚ β”œβ”€β”€ schemas.py
β”‚ └── requirements.txt
β”œβ”€β”€ frontend/
β”‚ β”œβ”€β”€ src/
β”‚ β”‚ β”œβ”€β”€ App.jsx
β”‚ β”‚ └── components/
β”‚ β”œβ”€β”€ index.html
β”‚ β”œβ”€β”€ package.json
β”‚ └── vite.config.js
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ processed/ ← .npz files only (~200MB), gitignored
β”‚ └── metadata.csv ← generated, gitignored
β”œβ”€β”€ outputs/
β”‚ β”œβ”€β”€ best.pt
β”‚ β”œβ”€β”€ confusion_matrix.png
β”‚ └── roc_curve.png
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ run_local.sh
β”‚ β”œβ”€β”€ run_cloud.sh
β”‚ β”œβ”€β”€ start_api.sh
β”‚ └── start_frontend.sh
β”œβ”€β”€ tests/
β”‚ └── test_model.py
β”œβ”€β”€ docs/
β”‚ β”œβ”€β”€ reproducibility_checklist.md
β”‚ β”œβ”€β”€ results_template.md
β”‚ └── README_DEMO.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
└── README.md
```
---
## Suggestions & Overrides
### ⚠️ Old files to DEPRECATE (keep but do not use)
`src/data/build_metadata.py` and `src/data/extract_frames.py` were written for a local download workflow. They are superseded by `stream_ff_dataset.py`. Keep them in the repo for reference but do not run them.
### ⚠️ ViT Input Resolution
Frames are extracted at 224Γ—224 directly in the streaming script. No resizing needed elsewhere.
### ⚠️ Internet Required for Phase 1
The streaming script needs internet during the ~20–60 min preprocessing run. After that, everything runs offline from the `.npz` files.
### ⚠️ Pre-trained Checkpoint Option
Use `timm`'s pretrained ViT weights (ImageNet). Fine-tuning for 5–10 epochs on 400 videos is sufficient for a compelling defence demo.
### βœ… Frontend: Keep it Simple
Single-page upload β†’ result. No auth, no database needed.