Spaces:

DevQueen
/

deepfake-server

Sleeping

App Files Files Community

deepfake-server / PLAN.md

DevQueen

Sync from GitHub via hub-sync

1dc2504 verified 27 days ago

preview code

Raw

History Blame Contribute Delete

23.2 kB

	# DeepFake Eye-Blink Detection — Cursor AI Build Plan

	## Project Overview

	An Enhanced Eye-Blinking LRCN (Long-term Recurrent ConvNet) for DeepFake detection using Attentive Adversarial Training (AAT) with a Vision Transformer (ViT) backbone. The research is by Alina Chikwado Godsaves under supervision of Mr. Akanji.

	The goal is to detect deepfake videos by analyzing unnatural eye-blinking patterns and fine-grained ocular artifacts (eyelid dynamics, pupil reflections) using a hybrid CNN/LSTM + ViT model hardened with adversarial training.

	Python version: 3.11 (PyTorch does not install on 3.13)

	---

	## What Is Already Scaffolded (Do NOT Recreate)

	All files below exist but need to be verified, completed, and wired together:

	\| Area \| Files \| Status \|
	\|------\|-------\|--------\|
	\| Config system \| `configs/base.yaml`, `configs/model/lrcn_vit.yaml`, `configs/train/aat_pgd.yaml` \| ✅ Exists \|
	\| Data pipeline \| `src/data/build_metadata.py`, `src/data/extract_frames.py`, `src/data/extract_eye_sequences.py`, `src/data/dataset.py` \| ✅ Exists, verify \|
	\| Model \| `src/models/backbones.py`, `src/models/lrcn_vit.py` \| ✅ Exists, verify \|
	\| Training \| `src/train/train.py`, `src/train/adversarial.py` \| ✅ Exists, verify \|
	\| Evaluation \| `src/eval/evaluate.py`, `src/eval/ablation.py`, `src/eval/plots.py` \| ✅ Exists, verify \|
	\| Explainability \| `src/viz/attention_maps.py` \| ✅ Exists \|
	\| Scripts \| `scripts/run_local.sh`, `scripts/run_cloud.sh` \| ✅ Exists \|
	\| Docs \| `docs/reproducibility_checklist.md`, `docs/results_template.md` \| ✅ Exists \|

	---

	## Phase 0 — Environment & Dependency Fix (FIRST PRIORITY)

	Goal: Get a working Python 3.11 venv with all ML/CV deps installed.

	### Tasks

	- [ ] 0.1 Confirm `python3.11` is available, or install via `pyenv` / system package manager
	- [ ] 0.2 Create venv: `python3.11 -m venv .venv311 && source .venv311/bin/activate`
	- [ ] 0.3 Pin exact working versions in `requirements.txt`:
	```
	torch==2.2.2
	torchvision==0.17.2
	timm==0.9.16
	opencv-python-headless==4.9.0.80
	mediapipe==0.10.11
	pandas==2.2.2
	numpy==1.26.4
	scikit-learn==1.4.2
	matplotlib==3.8.4
	seaborn==0.13.2
	tqdm==4.66.4
	pytorch-grad-cam==1.5.0
	Pillow==10.3.0
	pyyaml==6.0.1
	albumentations==1.4.3
	einops==0.7.0
	wandb==0.17.0
	datasets==2.19.0
	huggingface_hub==0.23.0
	av==12.0.0
	```
	- [ ] 0.4 Run `pip install -r requirements.txt` inside venv and confirm zero errors
	- [ ] 0.5 Smoke-test: `python -c "import torch; import timm; import mediapipe; import datasets; print('OK')"`
	- [ ] 0.6 Update `scripts/run_local.sh` to activate `.venv311` before any python calls
	- [ ] 0.7 One-time HuggingFace login (only needed once per machine):
	```bash
	huggingface-cli login
	# Paste your token from https://huggingface.co/settings/tokens
	# Token needs Read access only
	```

	---

	## Phase 1 — Dataset via HuggingFace Streaming (NO DOWNLOAD NEEDED)

	Goal: Stream FaceForensics++ c23 videos directly from HuggingFace one at a time, extract eye sequences into tiny `.npz` files, and discard each video. No raw videos are ever stored on disk.

	### How Streaming Works

	```
	HuggingFace server
	→ sends video #1 to RAM (temp, ~5MB)
	→ MediaPipe extracts eye crops + EAR signal
	→ saves tiny .npz file (~50KB) to data/processed/
	→ video is gone from memory
	→ repeat for video #2, #3 ... #200
	```

	At the end: ~200 `.npz` files totalling ~100–300MB. Zero raw videos on disk.

	### Dataset

	Source: `bitmind/FaceForensicsC23` on HuggingFace
	URL: https://huggingface.co/datasets/bitmind/FaceForensicsC23
	Contents: 7,000 MP4 videos — 1,000 real + 6,000 deepfakes (Deepfakes, Face2Face, FaceShifter, FaceSwap, NeuralTextures, DeepFakeDetection), c23 compression
	We use: 200 real (`/Real/`) + 200 fake (`/Deepfakes/`) = 400 videos total

	### Tasks

	- [ ] 1.1 Create `src/data/stream_ff_dataset.py` — a NEW script that replaces the old download-based `build_metadata.py` + `extract_frames.py` flow:

	```python
	"""
	Stream FaceForensics++ c23 from HuggingFace.
	Downloads one video at a time into RAM, extracts eye sequences,
	saves .npz files, discards the video. No raw videos stored on disk.

	Usage:
	python -m src.data.stream_ff_dataset \
	--out-root data/processed \
	--num-real 200 \
	--num-fake 200
	"""
	import io, tempfile, os, csv
	import numpy as np
	import cv2
	import mediapipe as mp
	from datasets import load_dataset
	from tqdm import tqdm
	from pathlib import Path

	HF_DATASET = "bitmind/FaceForensicsC23"
	REAL_PATH_MARKER = "/Real/"
	FAKE_PATH_MARKER = "/Deepfakes/" # use only Deepfakes subfolder, not all 6

	def compute_ear(landmarks, eye_indices):
	"""Compute Eye Aspect Ratio from MediaPipe landmarks."""
	# eye_indices: [p1, p2, p3, p4, p5, p6]
	p = [landmarks[i] for i in eye_indices]
	A = np.linalg.norm(np.array([p[1].x, p[1].y]) - np.array([p[5].x, p[5].y]))
	B = np.linalg.norm(np.array([p[2].x, p[2].y]) - np.array([p[4].x, p[4].y]))
	C = np.linalg.norm(np.array([p[0].x, p[0].y]) - np.array([p[3].x, p[3].y]))
	return (A + B) / (2.0 * C + 1e-6)

	def extract_sequences_from_video_bytes(video_bytes, label, video_id, seq_len=16):
	"""
	Given raw video bytes, extract overlapping eye-region sequences.
	Returns list of dicts: {'frames': (T,H,W,3), 'ear': (T,), 'label': int, 'video_id': str}
	"""
	# Write to a temp file so OpenCV can read it
	with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as f:
	f.write(video_bytes)
	tmp_path = f.name

	sequences = []
	face_mesh = mp.solutions.face_mesh.FaceMesh(
	static_image_mode=False, max_num_faces=1, refine_landmarks=True
	)

	# MediaPipe eye landmark indices (left eye outer→inner, right eye similar)
	LEFT_EYE = [33, 160, 158, 133, 153, 144]
	RIGHT_EYE = [362, 385, 387, 263, 373, 380]

	cap = cv2.VideoCapture(tmp_path)
	fps = cap.get(cv2.CAP_PROP_FPS) or 30
	frame_interval = max(1, int(fps / 10)) # sample at ~10fps

	all_frames, all_ears = [], []
	frame_idx = 0

	while True:
	ret, frame = cap.read()
	if not ret:
	break
	if frame_idx % frame_interval == 0:
	rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
	result = face_mesh.process(rgb)
	if result.multi_face_landmarks:
	lm = result.multi_face_landmarks[0].landmark
	h, w = frame.shape[:2]

	# Compute EAR (average of both eyes)
	left_ear = compute_ear(lm, LEFT_EYE)
	right_ear = compute_ear(lm, RIGHT_EYE)
	ear = (left_ear + right_ear) / 2.0

	# Crop eye region: bounding box around both eyes
	eye_pts = [lm[i] for i in LEFT_EYE + RIGHT_EYE]
	xs = [int(p.x * w) for p in eye_pts]
	ys = [int(p.y * h) for p in eye_pts]
	x1, x2 = max(0, min(xs) - 20), min(w, max(xs) + 20)
	y1, y2 = max(0, min(ys) - 20), min(h, max(ys) + 20)
	crop = rgb[y1:y2, x1:x2]

	if crop.size > 0:
	crop = cv2.resize(crop, (224, 224)) # ViT input size
	all_frames.append(crop)
	all_ears.append(ear)

	frame_idx += 1

	cap.release()
	face_mesh.close()
	os.unlink(tmp_path) # delete temp file immediately

	# Slice into non-overlapping sequences of length seq_len
	for i in range(0, len(all_frames) - seq_len + 1, seq_len):
	frames = np.stack(all_frames[i:i+seq_len]).astype(np.uint8)
	ears = np.array(all_ears[i:i+seq_len], dtype=np.float32)
	sequences.append({
	'frames': frames,
	'ear': ears,
	'label': label,
	'video_id': f"{video_id}_seq{i}"
	})

	return sequences


	def stream_and_extract(out_root, num_real=200, num_fake=200, seq_len=16):
	out_root = Path(out_root)
	out_root.mkdir(parents=True, exist_ok=True)

	# Stream dataset — never downloads the full zip
	ds = load_dataset(HF_DATASET, streaming=True, split="train")

	real_count, fake_count = 0, 0
	metadata_rows = []

	pbar = tqdm(total=num_real + num_fake, desc="Streaming videos")

	for item in ds:
	video_path_str = str(item.get('video', ''))

	is_real = REAL_PATH_MARKER in video_path_str and real_count < num_real
	is_fake = FAKE_PATH_MARKER in video_path_str and fake_count < num_fake

	if not is_real and not is_fake:
	continue

	label = 0 if is_real else 1
	video_id = Path(video_path_str).stem

	# item['video'] is a dict with 'bytes' key when streaming
	video_bytes = item['video']['bytes'] if isinstance(item['video'], dict) else None
	if video_bytes is None:
	continue

	sequences = extract_sequences_from_video_bytes(
	video_bytes, label, video_id, seq_len
	)

	for seq in sequences:
	npz_name = f"{seq['video_id']}.npz"
	npz_path = out_root / npz_name
	np.savez_compressed(
	npz_path,
	frames=seq['frames'],
	ear=seq['ear'],
	label=np.array(seq['label']),
	video_id=np.array(seq['video_id'])
	)
	metadata_rows.append({
	'npz_path': str(npz_path),
	'label': label,
	'video_id': video_id,
	'split': 'train' # will be reassigned below
	})

	if is_real:
	real_count += 1
	else:
	fake_count += 1

	pbar.update(1)

	if real_count >= num_real and fake_count >= num_fake:
	break

	pbar.close()

	# Assign splits: 70% train, 15% val, 15% test (by video_id, not sequence)
	unique_ids = list({r['video_id'] for r in metadata_rows})
	np.random.shuffle(unique_ids)
	n = len(unique_ids)
	train_ids = set(unique_ids[:int(0.7 * n)])
	val_ids = set(unique_ids[int(0.7 * n):int(0.85 * n)])

	for row in metadata_rows:
	if row['video_id'] in train_ids:
	row['split'] = 'train'
	elif row['video_id'] in val_ids:
	row['split'] = 'val'
	else:
	row['split'] = 'test'

	# Write metadata CSV
	csv_path = Path('data/metadata.csv')
	csv_path.parent.mkdir(exist_ok=True)
	with open(csv_path, 'w', newline='') as f:
	writer = csv.DictWriter(f, fieldnames=['npz_path', 'label', 'video_id', 'split'])
	writer.writeheader()
	writer.writerows(metadata_rows)

	print(f"\nDone! {real_count} real + {fake_count} fake videos processed.")
	print(f"Total sequences: {len(metadata_rows)}")
	print(f"Metadata written to: {csv_path}")
	print(f"Sequences saved to: {out_root}")


	if __name__ == '__main__':
	import argparse
	parser = argparse.ArgumentParser()
	parser.add_argument('--out-root', default='data/processed')
	parser.add_argument('--num-real', type=int, default=200)
	parser.add_argument('--num-fake', type=int, default=200)
	parser.add_argument('--seq-len', type=int, default=16)
	args = parser.parse_args()
	stream_and_extract(args.out_root, args.num_real, args.num_fake, args.seq_len)
	```

	- [ ] 1.2 Run the streaming script:
	```bash
	source .venv311/bin/activate
	python -m src.data.stream_ff_dataset \
	--out-root data/processed \
	--num-real 200 \
	--num-fake 200
	```
	This will run for ~20–60 minutes depending on internet speed. It streams each video, processes it, saves a tiny `.npz`, and moves on. Your terminal will show a progress bar.

	- [ ] 1.3 When done, verify output:
	```bash
	ls data/processed/ \| wc -l # should be several hundred .npz files
	du -sh data/processed/ # should be ~100-300MB total
	python -c "
	import numpy as np
	d = np.load('data/processed/' + __import__('os').listdir('data/processed')[0], allow_pickle=True)
	print('frames:', d['frames'].shape) # expect (16, 224, 224, 3)
	print('ear:', d['ear'].shape) # expect (16,)
	print('label:', d['label']) # expect 0 or 1
	"
	```

	- [ ] 1.4 Verify `data/metadata.csv` has rows with `npz_path`, `label`, `video_id`, `split` columns and a healthy mix of train/val/test rows

	- [ ] 1.5 Update `src/data/dataset.py` to read from `data/metadata.csv` (pointing to `.npz` files) instead of from raw video paths. The `__getitem__` contract remains unchanged:
	```python
	{'frames': Tensor[T,3,224,224], 'ear': Tensor[T], 'label': int}
	```

	- [ ] 1.6 Update `configs/base.yaml`:
	```yaml
	data:
	metadata_csv: data/metadata.csv
	processed_root: data/processed
	seq_len: 16
	img_size: 224
	```

	- [ ] 1.7 Add to `.gitignore`:
	```
	data/processed/
	data/raw/
	data/metadata.csv
	outputs/
	*.npz
	*.pt
	.venv311/
	```

	---

	## Phase 2 — Dataset Loader Verification

	Goal: Confirm `src/data/dataset.py` correctly reads the `.npz` files produced by streaming.

	### Tasks

	- [ ] 2.1 Open `src/data/dataset.py` — update it to read from `metadata.csv` instead of raw video paths. Each row's `npz_path` points directly to a processed sequence file.
	- [ ] 2.2 Add `albumentations` augmentations for training split only:
	```python
	import albumentations as A
	from albumentations.pytorch import ToTensorV2

	train_transform = A.Compose([
	A.HorizontalFlip(p=0.5),
	A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, p=0.5),
	A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
	A.ImageCompression(quality_lower=70, quality_upper=100, p=0.3),
	A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
	ToTensorV2(),
	])
	```
	- [ ] 2.3 Smoke-test the DataLoader:
	```python
	from src.data.dataset import EyeBlinkDataset
	ds = EyeBlinkDataset('data/metadata.csv', split='train')
	sample = ds[0]
	assert sample['frames'].shape == (16, 3, 224, 224)
	assert sample['ear'].shape == (16,)
	assert sample['label'] in [0, 1]
	print("DataLoader OK")
	```

	---

	## Phase 3 — Model Architecture Verification & Fix

	Goal: Ensure the LRCN + ViT hybrid model is correctly implemented and matches the research proposal.

	### Architecture Spec (from proposal)

	```
	Input: eye-region sequence (T=16 frames, each 224×224 RGB) + EAR signal (T floats)
	↓
	ViT Backbone (timm: vit_small_patch16_224, pretrained=True)
	→ Per-frame [CLS] token → shape (T, 384)
	↓
	LSTM Temporal Encoder
	→ Hidden size: 256, Num layers: 2, Dropout: 0.3
	↓
	Blink Dynamics Head
	→ Concatenate LSTM output + EAR
	→ FC(257, 128) → ReLU
	→ Blink timing constraint (0.1–0.4s window)
	↓
	Classifier Head
	→ FC(256, 128) → ReLU → Dropout(0.5) → FC(128, 2)
	→ Output: [real_logit, fake_logit]
	```

	### Tasks

	- [ ] 3.1 Open `src/models/backbones.py` — verify `build_backbone(config)` returns a timm ViT. For `vit_small_patch16_224` embed dim = 384.
	- [ ] 3.2 Open `src/models/lrcn_vit.py` — verify forward pass. Frames arrive as `(B, T, 3, 224, 224)`. Reshape to `(B*T, 3, 224, 224)` before ViT, then reshape back to `(B, T, embed_dim)` before LSTM.
	- [ ] 3.3 Add attention consistency loss: KL-divergence between adjacent frame ViT attention maps, weighted by `lambda_attn`.
	- [ ] 3.4 Add blink timing regularizer: penalize uncertain predictions when EAR < 0.2 but blink duration is outside 0.1–0.4s. Weight: `lambda_blink`.
	- [ ] 3.5 Add unit test in `tests/test_model.py`:
	```python
	model = LRCNViT(config)
	dummy = {'frames': torch.randn(2, 16, 3, 224, 224), 'ear': torch.randn(2, 16)}
	out = model(dummy)
	assert out['logits'].shape == (2, 2)
	```

	---

	## Phase 4 — Training Loop Fix & Wire-Up

	Goal: Get the full training loop running end-to-end with adversarial training and all loss components.

	### Tasks

	- [ ] 4.1 Open `src/train/train.py` — verify it loads config, DataLoader, model, AdamW, LR scheduler, and saves `outputs/best.pt` on val AUC improvement.
	- [ ] 4.2 Wire in `wandb`: if `config.wandb.enabled: true`, call `wandb.init()` and log metrics each epoch.
	- [ ] 4.3 Total loss formula:
	```
	L_total = L_ce(clean)
	+ alpha * L_ce(adversarial)
	+ lambda_attn * L_attn_consistency
	+ lambda_blink * L_blink_regularizer
	```
	- [ ] 4.4 Open `src/train/adversarial.py` — verify PGD: `eps=8/255`, `steps=10`, applied only to eye-region frames.
	- [ ] 4.5 Add gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`
	- [ ] 4.6 Update `configs/train/aat_pgd.yaml`:
	```yaml
	epochs: 30
	batch_size: 16
	lr: 3e-4
	weight_decay: 1e-4
	alpha: 0.5
	lambda_attn: 0.1
	lambda_blink: 0.05
	pgd_eps: 0.031
	pgd_steps: 10
	wandb:
	enabled: false
	project: "deepfake-eye-blink"
	```
	- [ ] 4.7 Smoke-train: 2 epochs on 50 samples — confirm zero errors.
	- [ ] 4.8 Full training: `python -m src.train.train --config configs/train/aat_pgd.yaml`

	---

	## Phase 5 — Evaluation & Ablation

	Goal: Produce evaluation numbers and ablation table for the thesis.

	### Tasks

	- [ ] 5.1 Open `src/eval/evaluate.py` — verify it outputs Accuracy, Precision, Recall, F1, AUC.
	- [ ] 5.2 Run: `python -m src.eval.evaluate --checkpoint outputs/best.pt --config configs/train/aat_pgd.yaml`
	- [ ] 5.3 Open `src/eval/ablation.py` — confirm 4 configs: Full / No AAT / No ViT / No blink regularizer.
	- [ ] 5.4 Run ablation: `python -m src.eval.ablation --config configs/train/aat_pgd.yaml`
	- [ ] 5.5 Open `src/eval/plots.py` — confirm it generates `confusion_matrix.png` and `roc_curve.png`.
	- [ ] 5.6 Fill in `docs/results_template.md` with actual numbers.

	---

	## Phase 6 — Inference API

	Goal: FastAPI server that accepts an uploaded video and returns a prediction.

	### New files
	```
	api/
	main.py
	inference.py # reuses the same eye extraction logic from stream_ff_dataset.py
	schemas.py
	requirements.txt
	```

	### Tasks

	- [ ] 6.1 `api/inference.py` — reuse `extract_sequences_from_video_bytes()` from `stream_ff_dataset.py`. Load model once, run forward pass on all sequences, average predictions across sequences.
	- [ ] 6.2 `api/main.py` — `/predict` endpoint (POST, multipart file upload) + `/health` endpoint.
	- [ ] 6.3 Load model at startup via FastAPI `lifespan`, not per-request.
	- [ ] 6.4 Add CORS for `http://localhost:5173`.
	- [ ] 6.5 `api/requirements.txt`: `fastapi>=0.111.0`, `uvicorn[standard]`, `python-multipart>=0.0.9`
	- [ ] 6.6 Test: `curl -X POST http://localhost:8000/predict -F "file=@test_video.mp4"`
	- [ ] 6.7 `scripts/start_api.sh`:
	```bash
	source .venv311/bin/activate
	uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
	```

	---

	## Phase 7 — Demo Frontend

	Goal: React web UI for the defence demonstration.

	### Stack: React + Vite + Tailwind + Recharts

	```
	frontend/
	src/
	App.jsx
	components/
	VideoUploader.jsx
	ResultCard.jsx
	FrameChart.jsx
	AttentionViewer.jsx
	index.html
	package.json
	vite.config.js
	```

	### Tasks

	- [ ] 7.1 `cd frontend && npm create vite@latest . -- --template react && npm install`
	- [ ] 7.2 `npm install tailwindcss recharts axios`
	- [ ] 7.3 `VideoUploader.jsx`: drag-and-drop or file picker for `.mp4/.avi/.mov`, video preview, "Analyse Video" button, loading spinner.
	- [ ] 7.4 `ResultCard.jsx`: REAL (green) / FAKE (red) verdict badge, confidence %, blink rate stat.
	- [ ] 7.5 `FrameChart.jsx`: Recharts line chart of per-frame fake probability, frames above 0.5 highlighted red.
	- [ ] 7.6 `AttentionViewer.jsx`: Grad-CAM attention overlay image from API response.
	- [ ] 7.7 Proxy in `vite.config.js`: `/predict` → `http://localhost:8000/predict`
	- [ ] 7.8 `frontend/.env`: `VITE_API_URL=http://localhost:8000`
	- [ ] 7.9 `scripts/start_frontend.sh`:
	```bash
	cd frontend && npm run dev
	```

	---

	## Phase 8 — Integration & Final QA

	- [ ] 8.1 Run API + frontend together. Upload one of the `.npz` source videos as a test.
	- [ ] 8.2 Test with a real webcam recording — should return REAL.
	- [ ] 8.3 Fix any CORS issues.
	- [ ] 8.4 Create `docs/README_DEMO.md`:
	```
	1. source .venv311/bin/activate
	2. ./scripts/start_api.sh (Terminal 1)
	3. ./scripts/start_frontend.sh (Terminal 2)
	4. Open http://localhost:5173
	```
	- [ ] 8.5 Document exact setup commands for a fresh machine.

	---

	## Project Directory Structure (Final)

	```
	deepfake-detector/
	├── configs/
	│ ├── base.yaml
	│ ├── model/lrcn_vit.yaml
	│ └── train/aat_pgd.yaml
	├── src/
	│ ├── data/
	│ │ ├── stream_ff_dataset.py ← NEW (replaces download-based flow)
	│ │ ├── extract_eye_sequences.py
	│ │ └── dataset.py
	│ ├── models/
	│ │ ├── backbones.py
	│ │ └── lrcn_vit.py
	│ ├── train/
	│ │ ├── train.py
	│ │ └── adversarial.py
	│ ├── eval/
	│ │ ├── evaluate.py
	│ │ ├── ablation.py
	│ │ └── plots.py
	│ ├── viz/
	│ │ └── attention_maps.py
	│ └── utils.py
	├── api/
	│ ├── main.py
	│ ├── inference.py
	│ ├── schemas.py
	│ └── requirements.txt
	├── frontend/
	│ ├── src/
	│ │ ├── App.jsx
	│ │ └── components/
	│ ├── index.html
	│ ├── package.json
	│ └── vite.config.js
	├── data/
	│ ├── processed/ ← .npz files only (~200MB), gitignored
	│ └── metadata.csv ← generated, gitignored
	├── outputs/
	│ ├── best.pt
	│ ├── confusion_matrix.png
	│ └── roc_curve.png
	├── scripts/
	│ ├── run_local.sh
	│ ├── run_cloud.sh
	│ ├── start_api.sh
	│ └── start_frontend.sh
	├── tests/
	│ └── test_model.py
	├── docs/
	│ ├── reproducibility_checklist.md
	│ ├── results_template.md
	│ └── README_DEMO.md
	├── .gitignore
	├── requirements.txt
	└── README.md
	```

	---

	## Suggestions & Overrides

	### ⚠️ Old files to DEPRECATE (keep but do not use)
	`src/data/build_metadata.py` and `src/data/extract_frames.py` were written for a local download workflow. They are superseded by `stream_ff_dataset.py`. Keep them in the repo for reference but do not run them.

	### ⚠️ ViT Input Resolution
	Frames are extracted at 224×224 directly in the streaming script. No resizing needed elsewhere.

	### ⚠️ Internet Required for Phase 1
	The streaming script needs internet during the ~20–60 min preprocessing run. After that, everything runs offline from the `.npz` files.

	### ⚠️ Pre-trained Checkpoint Option
	Use `timm`'s pretrained ViT weights (ImageNet). Fine-tuning for 5–10 epochs on 400 videos is sufficient for a compelling defence demo.

	### ✅ Frontend: Keep it Simple
	Single-page upload → result. No auth, no database needed.