Spaces:
Sleeping
Sleeping
| # DeepFake Eye-Blink Detection β Cursor AI Build Plan | |
| ## Project Overview | |
| An Enhanced Eye-Blinking LRCN (Long-term Recurrent ConvNet) for DeepFake detection using Attentive Adversarial Training (AAT) with a Vision Transformer (ViT) backbone. The research is by **Alina Chikwado Godsaves** under supervision of **Mr. Akanji**. | |
| The goal is to detect deepfake videos by analyzing unnatural eye-blinking patterns and fine-grained ocular artifacts (eyelid dynamics, pupil reflections) using a hybrid CNN/LSTM + ViT model hardened with adversarial training. | |
| **Python version: 3.11** (PyTorch does not install on 3.13) | |
| --- | |
| ## What Is Already Scaffolded (Do NOT Recreate) | |
| All files below exist but need to be **verified, completed, and wired together**: | |
| | Area | Files | Status | | |
| |------|-------|--------| | |
| | Config system | `configs/base.yaml`, `configs/model/lrcn_vit.yaml`, `configs/train/aat_pgd.yaml` | β Exists | | |
| | Data pipeline | `src/data/build_metadata.py`, `src/data/extract_frames.py`, `src/data/extract_eye_sequences.py`, `src/data/dataset.py` | β Exists, verify | | |
| | Model | `src/models/backbones.py`, `src/models/lrcn_vit.py` | β Exists, verify | | |
| | Training | `src/train/train.py`, `src/train/adversarial.py` | β Exists, verify | | |
| | Evaluation | `src/eval/evaluate.py`, `src/eval/ablation.py`, `src/eval/plots.py` | β Exists, verify | | |
| | Explainability | `src/viz/attention_maps.py` | β Exists | | |
| | Scripts | `scripts/run_local.sh`, `scripts/run_cloud.sh` | β Exists | | |
| | Docs | `docs/reproducibility_checklist.md`, `docs/results_template.md` | β Exists | | |
| --- | |
| ## Phase 0 β Environment & Dependency Fix (FIRST PRIORITY) | |
| **Goal:** Get a working Python 3.11 venv with all ML/CV deps installed. | |
| ### Tasks | |
| - [ ] **0.1** Confirm `python3.11` is available, or install via `pyenv` / system package manager | |
| - [ ] **0.2** Create venv: `python3.11 -m venv .venv311 && source .venv311/bin/activate` | |
| - [ ] **0.3** Pin exact working versions in `requirements.txt`: | |
| ``` | |
| torch==2.2.2 | |
| torchvision==0.17.2 | |
| timm==0.9.16 | |
| opencv-python-headless==4.9.0.80 | |
| mediapipe==0.10.11 | |
| pandas==2.2.2 | |
| numpy==1.26.4 | |
| scikit-learn==1.4.2 | |
| matplotlib==3.8.4 | |
| seaborn==0.13.2 | |
| tqdm==4.66.4 | |
| pytorch-grad-cam==1.5.0 | |
| Pillow==10.3.0 | |
| pyyaml==6.0.1 | |
| albumentations==1.4.3 | |
| einops==0.7.0 | |
| wandb==0.17.0 | |
| datasets==2.19.0 | |
| huggingface_hub==0.23.0 | |
| av==12.0.0 | |
| ``` | |
| - [ ] **0.4** Run `pip install -r requirements.txt` inside venv and confirm zero errors | |
| - [ ] **0.5** Smoke-test: `python -c "import torch; import timm; import mediapipe; import datasets; print('OK')"` | |
| - [ ] **0.6** Update `scripts/run_local.sh` to activate `.venv311` before any python calls | |
| - [ ] **0.7** One-time HuggingFace login (only needed once per machine): | |
| ```bash | |
| huggingface-cli login | |
| # Paste your token from https://huggingface.co/settings/tokens | |
| # Token needs Read access only | |
| ``` | |
| --- | |
| ## Phase 1 β Dataset via HuggingFace Streaming (NO DOWNLOAD NEEDED) | |
| **Goal:** Stream FaceForensics++ c23 videos directly from HuggingFace one at a time, extract eye sequences into tiny `.npz` files, and discard each video. No raw videos are ever stored on disk. | |
| ### How Streaming Works | |
| ``` | |
| HuggingFace server | |
| β sends video #1 to RAM (temp, ~5MB) | |
| β MediaPipe extracts eye crops + EAR signal | |
| β saves tiny .npz file (~50KB) to data/processed/ | |
| β video is gone from memory | |
| β repeat for video #2, #3 ... #200 | |
| ``` | |
| At the end: ~200 `.npz` files totalling ~100β300MB. Zero raw videos on disk. | |
| ### Dataset | |
| **Source:** `bitmind/FaceForensicsC23` on HuggingFace | |
| **URL:** https://huggingface.co/datasets/bitmind/FaceForensicsC23 | |
| **Contents:** 7,000 MP4 videos β 1,000 real + 6,000 deepfakes (Deepfakes, Face2Face, FaceShifter, FaceSwap, NeuralTextures, DeepFakeDetection), c23 compression | |
| **We use:** 200 real (`/Real/`) + 200 fake (`/Deepfakes/`) = 400 videos total | |
| ### Tasks | |
| - [ ] **1.1** Create `src/data/stream_ff_dataset.py` β a NEW script that replaces the old download-based `build_metadata.py` + `extract_frames.py` flow: | |
| ```python | |
| """ | |
| Stream FaceForensics++ c23 from HuggingFace. | |
| Downloads one video at a time into RAM, extracts eye sequences, | |
| saves .npz files, discards the video. No raw videos stored on disk. | |
| Usage: | |
| python -m src.data.stream_ff_dataset \ | |
| --out-root data/processed \ | |
| --num-real 200 \ | |
| --num-fake 200 | |
| """ | |
| import io, tempfile, os, csv | |
| import numpy as np | |
| import cv2 | |
| import mediapipe as mp | |
| from datasets import load_dataset | |
| from tqdm import tqdm | |
| from pathlib import Path | |
| HF_DATASET = "bitmind/FaceForensicsC23" | |
| REAL_PATH_MARKER = "/Real/" | |
| FAKE_PATH_MARKER = "/Deepfakes/" # use only Deepfakes subfolder, not all 6 | |
| def compute_ear(landmarks, eye_indices): | |
| """Compute Eye Aspect Ratio from MediaPipe landmarks.""" | |
| # eye_indices: [p1, p2, p3, p4, p5, p6] | |
| p = [landmarks[i] for i in eye_indices] | |
| A = np.linalg.norm(np.array([p[1].x, p[1].y]) - np.array([p[5].x, p[5].y])) | |
| B = np.linalg.norm(np.array([p[2].x, p[2].y]) - np.array([p[4].x, p[4].y])) | |
| C = np.linalg.norm(np.array([p[0].x, p[0].y]) - np.array([p[3].x, p[3].y])) | |
| return (A + B) / (2.0 * C + 1e-6) | |
| def extract_sequences_from_video_bytes(video_bytes, label, video_id, seq_len=16): | |
| """ | |
| Given raw video bytes, extract overlapping eye-region sequences. | |
| Returns list of dicts: {'frames': (T,H,W,3), 'ear': (T,), 'label': int, 'video_id': str} | |
| """ | |
| # Write to a temp file so OpenCV can read it | |
| with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as f: | |
| f.write(video_bytes) | |
| tmp_path = f.name | |
| sequences = [] | |
| face_mesh = mp.solutions.face_mesh.FaceMesh( | |
| static_image_mode=False, max_num_faces=1, refine_landmarks=True | |
| ) | |
| # MediaPipe eye landmark indices (left eye outerβinner, right eye similar) | |
| LEFT_EYE = [33, 160, 158, 133, 153, 144] | |
| RIGHT_EYE = [362, 385, 387, 263, 373, 380] | |
| cap = cv2.VideoCapture(tmp_path) | |
| fps = cap.get(cv2.CAP_PROP_FPS) or 30 | |
| frame_interval = max(1, int(fps / 10)) # sample at ~10fps | |
| all_frames, all_ears = [], [] | |
| frame_idx = 0 | |
| while True: | |
| ret, frame = cap.read() | |
| if not ret: | |
| break | |
| if frame_idx % frame_interval == 0: | |
| rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) | |
| result = face_mesh.process(rgb) | |
| if result.multi_face_landmarks: | |
| lm = result.multi_face_landmarks[0].landmark | |
| h, w = frame.shape[:2] | |
| # Compute EAR (average of both eyes) | |
| left_ear = compute_ear(lm, LEFT_EYE) | |
| right_ear = compute_ear(lm, RIGHT_EYE) | |
| ear = (left_ear + right_ear) / 2.0 | |
| # Crop eye region: bounding box around both eyes | |
| eye_pts = [lm[i] for i in LEFT_EYE + RIGHT_EYE] | |
| xs = [int(p.x * w) for p in eye_pts] | |
| ys = [int(p.y * h) for p in eye_pts] | |
| x1, x2 = max(0, min(xs) - 20), min(w, max(xs) + 20) | |
| y1, y2 = max(0, min(ys) - 20), min(h, max(ys) + 20) | |
| crop = rgb[y1:y2, x1:x2] | |
| if crop.size > 0: | |
| crop = cv2.resize(crop, (224, 224)) # ViT input size | |
| all_frames.append(crop) | |
| all_ears.append(ear) | |
| frame_idx += 1 | |
| cap.release() | |
| face_mesh.close() | |
| os.unlink(tmp_path) # delete temp file immediately | |
| # Slice into non-overlapping sequences of length seq_len | |
| for i in range(0, len(all_frames) - seq_len + 1, seq_len): | |
| frames = np.stack(all_frames[i:i+seq_len]).astype(np.uint8) | |
| ears = np.array(all_ears[i:i+seq_len], dtype=np.float32) | |
| sequences.append({ | |
| 'frames': frames, | |
| 'ear': ears, | |
| 'label': label, | |
| 'video_id': f"{video_id}_seq{i}" | |
| }) | |
| return sequences | |
| def stream_and_extract(out_root, num_real=200, num_fake=200, seq_len=16): | |
| out_root = Path(out_root) | |
| out_root.mkdir(parents=True, exist_ok=True) | |
| # Stream dataset β never downloads the full zip | |
| ds = load_dataset(HF_DATASET, streaming=True, split="train") | |
| real_count, fake_count = 0, 0 | |
| metadata_rows = [] | |
| pbar = tqdm(total=num_real + num_fake, desc="Streaming videos") | |
| for item in ds: | |
| video_path_str = str(item.get('video', '')) | |
| is_real = REAL_PATH_MARKER in video_path_str and real_count < num_real | |
| is_fake = FAKE_PATH_MARKER in video_path_str and fake_count < num_fake | |
| if not is_real and not is_fake: | |
| continue | |
| label = 0 if is_real else 1 | |
| video_id = Path(video_path_str).stem | |
| # item['video'] is a dict with 'bytes' key when streaming | |
| video_bytes = item['video']['bytes'] if isinstance(item['video'], dict) else None | |
| if video_bytes is None: | |
| continue | |
| sequences = extract_sequences_from_video_bytes( | |
| video_bytes, label, video_id, seq_len | |
| ) | |
| for seq in sequences: | |
| npz_name = f"{seq['video_id']}.npz" | |
| npz_path = out_root / npz_name | |
| np.savez_compressed( | |
| npz_path, | |
| frames=seq['frames'], | |
| ear=seq['ear'], | |
| label=np.array(seq['label']), | |
| video_id=np.array(seq['video_id']) | |
| ) | |
| metadata_rows.append({ | |
| 'npz_path': str(npz_path), | |
| 'label': label, | |
| 'video_id': video_id, | |
| 'split': 'train' # will be reassigned below | |
| }) | |
| if is_real: | |
| real_count += 1 | |
| else: | |
| fake_count += 1 | |
| pbar.update(1) | |
| if real_count >= num_real and fake_count >= num_fake: | |
| break | |
| pbar.close() | |
| # Assign splits: 70% train, 15% val, 15% test (by video_id, not sequence) | |
| unique_ids = list({r['video_id'] for r in metadata_rows}) | |
| np.random.shuffle(unique_ids) | |
| n = len(unique_ids) | |
| train_ids = set(unique_ids[:int(0.7 * n)]) | |
| val_ids = set(unique_ids[int(0.7 * n):int(0.85 * n)]) | |
| for row in metadata_rows: | |
| if row['video_id'] in train_ids: | |
| row['split'] = 'train' | |
| elif row['video_id'] in val_ids: | |
| row['split'] = 'val' | |
| else: | |
| row['split'] = 'test' | |
| # Write metadata CSV | |
| csv_path = Path('data/metadata.csv') | |
| csv_path.parent.mkdir(exist_ok=True) | |
| with open(csv_path, 'w', newline='') as f: | |
| writer = csv.DictWriter(f, fieldnames=['npz_path', 'label', 'video_id', 'split']) | |
| writer.writeheader() | |
| writer.writerows(metadata_rows) | |
| print(f"\nDone! {real_count} real + {fake_count} fake videos processed.") | |
| print(f"Total sequences: {len(metadata_rows)}") | |
| print(f"Metadata written to: {csv_path}") | |
| print(f"Sequences saved to: {out_root}") | |
| if __name__ == '__main__': | |
| import argparse | |
| parser = argparse.ArgumentParser() | |
| parser.add_argument('--out-root', default='data/processed') | |
| parser.add_argument('--num-real', type=int, default=200) | |
| parser.add_argument('--num-fake', type=int, default=200) | |
| parser.add_argument('--seq-len', type=int, default=16) | |
| args = parser.parse_args() | |
| stream_and_extract(args.out_root, args.num_real, args.num_fake, args.seq_len) | |
| ``` | |
| - [ ] **1.2** Run the streaming script: | |
| ```bash | |
| source .venv311/bin/activate | |
| python -m src.data.stream_ff_dataset \ | |
| --out-root data/processed \ | |
| --num-real 200 \ | |
| --num-fake 200 | |
| ``` | |
| This will run for ~20β60 minutes depending on internet speed. It streams each video, processes it, saves a tiny `.npz`, and moves on. Your terminal will show a progress bar. | |
| - [ ] **1.3** When done, verify output: | |
| ```bash | |
| ls data/processed/ | wc -l # should be several hundred .npz files | |
| du -sh data/processed/ # should be ~100-300MB total | |
| python -c " | |
| import numpy as np | |
| d = np.load('data/processed/' + __import__('os').listdir('data/processed')[0], allow_pickle=True) | |
| print('frames:', d['frames'].shape) # expect (16, 224, 224, 3) | |
| print('ear:', d['ear'].shape) # expect (16,) | |
| print('label:', d['label']) # expect 0 or 1 | |
| " | |
| ``` | |
| - [ ] **1.4** Verify `data/metadata.csv` has rows with `npz_path`, `label`, `video_id`, `split` columns and a healthy mix of train/val/test rows | |
| - [ ] **1.5** Update `src/data/dataset.py` to read from `data/metadata.csv` (pointing to `.npz` files) instead of from raw video paths. The `__getitem__` contract remains unchanged: | |
| ```python | |
| {'frames': Tensor[T,3,224,224], 'ear': Tensor[T], 'label': int} | |
| ``` | |
| - [ ] **1.6** Update `configs/base.yaml`: | |
| ```yaml | |
| data: | |
| metadata_csv: data/metadata.csv | |
| processed_root: data/processed | |
| seq_len: 16 | |
| img_size: 224 | |
| ``` | |
| - [ ] **1.7** Add to `.gitignore`: | |
| ``` | |
| data/processed/ | |
| data/raw/ | |
| data/metadata.csv | |
| outputs/ | |
| *.npz | |
| *.pt | |
| .venv311/ | |
| ``` | |
| --- | |
| ## Phase 2 β Dataset Loader Verification | |
| **Goal:** Confirm `src/data/dataset.py` correctly reads the `.npz` files produced by streaming. | |
| ### Tasks | |
| - [ ] **2.1** Open `src/data/dataset.py` β update it to read from `metadata.csv` instead of raw video paths. Each row's `npz_path` points directly to a processed sequence file. | |
| - [ ] **2.2** Add `albumentations` augmentations for training split only: | |
| ```python | |
| import albumentations as A | |
| from albumentations.pytorch import ToTensorV2 | |
| train_transform = A.Compose([ | |
| A.HorizontalFlip(p=0.5), | |
| A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, p=0.5), | |
| A.GaussNoise(var_limit=(10.0, 50.0), p=0.3), | |
| A.ImageCompression(quality_lower=70, quality_upper=100, p=0.3), | |
| A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), | |
| ToTensorV2(), | |
| ]) | |
| ``` | |
| - [ ] **2.3** Smoke-test the DataLoader: | |
| ```python | |
| from src.data.dataset import EyeBlinkDataset | |
| ds = EyeBlinkDataset('data/metadata.csv', split='train') | |
| sample = ds[0] | |
| assert sample['frames'].shape == (16, 3, 224, 224) | |
| assert sample['ear'].shape == (16,) | |
| assert sample['label'] in [0, 1] | |
| print("DataLoader OK") | |
| ``` | |
| --- | |
| ## Phase 3 β Model Architecture Verification & Fix | |
| **Goal:** Ensure the LRCN + ViT hybrid model is correctly implemented and matches the research proposal. | |
| ### Architecture Spec (from proposal) | |
| ``` | |
| Input: eye-region sequence (T=16 frames, each 224Γ224 RGB) + EAR signal (T floats) | |
| β | |
| ViT Backbone (timm: vit_small_patch16_224, pretrained=True) | |
| β Per-frame [CLS] token β shape (T, 384) | |
| β | |
| LSTM Temporal Encoder | |
| β Hidden size: 256, Num layers: 2, Dropout: 0.3 | |
| β | |
| Blink Dynamics Head | |
| β Concatenate LSTM output + EAR | |
| β FC(257, 128) β ReLU | |
| β Blink timing constraint (0.1β0.4s window) | |
| β | |
| Classifier Head | |
| β FC(256, 128) β ReLU β Dropout(0.5) β FC(128, 2) | |
| β Output: [real_logit, fake_logit] | |
| ``` | |
| ### Tasks | |
| - [ ] **3.1** Open `src/models/backbones.py` β verify `build_backbone(config)` returns a timm ViT. For `vit_small_patch16_224` embed dim = 384. | |
| - [ ] **3.2** Open `src/models/lrcn_vit.py` β verify forward pass. Frames arrive as `(B, T, 3, 224, 224)`. Reshape to `(B*T, 3, 224, 224)` before ViT, then reshape back to `(B, T, embed_dim)` before LSTM. | |
| - [ ] **3.3** Add **attention consistency loss**: KL-divergence between adjacent frame ViT attention maps, weighted by `lambda_attn`. | |
| - [ ] **3.4** Add **blink timing regularizer**: penalize uncertain predictions when EAR < 0.2 but blink duration is outside 0.1β0.4s. Weight: `lambda_blink`. | |
| - [ ] **3.5** Add unit test in `tests/test_model.py`: | |
| ```python | |
| model = LRCNViT(config) | |
| dummy = {'frames': torch.randn(2, 16, 3, 224, 224), 'ear': torch.randn(2, 16)} | |
| out = model(dummy) | |
| assert out['logits'].shape == (2, 2) | |
| ``` | |
| --- | |
| ## Phase 4 β Training Loop Fix & Wire-Up | |
| **Goal:** Get the full training loop running end-to-end with adversarial training and all loss components. | |
| ### Tasks | |
| - [ ] **4.1** Open `src/train/train.py` β verify it loads config, DataLoader, model, AdamW, LR scheduler, and saves `outputs/best.pt` on val AUC improvement. | |
| - [ ] **4.2** **Wire in `wandb`**: if `config.wandb.enabled: true`, call `wandb.init()` and log metrics each epoch. | |
| - [ ] **4.3** Total loss formula: | |
| ``` | |
| L_total = L_ce(clean) | |
| + alpha * L_ce(adversarial) | |
| + lambda_attn * L_attn_consistency | |
| + lambda_blink * L_blink_regularizer | |
| ``` | |
| - [ ] **4.4** Open `src/train/adversarial.py` β verify PGD: `eps=8/255`, `steps=10`, applied only to eye-region frames. | |
| - [ ] **4.5** Add gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` | |
| - [ ] **4.6** Update `configs/train/aat_pgd.yaml`: | |
| ```yaml | |
| epochs: 30 | |
| batch_size: 16 | |
| lr: 3e-4 | |
| weight_decay: 1e-4 | |
| alpha: 0.5 | |
| lambda_attn: 0.1 | |
| lambda_blink: 0.05 | |
| pgd_eps: 0.031 | |
| pgd_steps: 10 | |
| wandb: | |
| enabled: false | |
| project: "deepfake-eye-blink" | |
| ``` | |
| - [ ] **4.7** Smoke-train: 2 epochs on 50 samples β confirm zero errors. | |
| - [ ] **4.8** Full training: `python -m src.train.train --config configs/train/aat_pgd.yaml` | |
| --- | |
| ## Phase 5 β Evaluation & Ablation | |
| **Goal:** Produce evaluation numbers and ablation table for the thesis. | |
| ### Tasks | |
| - [ ] **5.1** Open `src/eval/evaluate.py` β verify it outputs Accuracy, Precision, Recall, F1, AUC. | |
| - [ ] **5.2** Run: `python -m src.eval.evaluate --checkpoint outputs/best.pt --config configs/train/aat_pgd.yaml` | |
| - [ ] **5.3** Open `src/eval/ablation.py` β confirm 4 configs: Full / No AAT / No ViT / No blink regularizer. | |
| - [ ] **5.4** Run ablation: `python -m src.eval.ablation --config configs/train/aat_pgd.yaml` | |
| - [ ] **5.5** Open `src/eval/plots.py` β confirm it generates `confusion_matrix.png` and `roc_curve.png`. | |
| - [ ] **5.6** Fill in `docs/results_template.md` with actual numbers. | |
| --- | |
| ## Phase 6 β Inference API | |
| **Goal:** FastAPI server that accepts an uploaded video and returns a prediction. | |
| ### New files | |
| ``` | |
| api/ | |
| main.py | |
| inference.py # reuses the same eye extraction logic from stream_ff_dataset.py | |
| schemas.py | |
| requirements.txt | |
| ``` | |
| ### Tasks | |
| - [ ] **6.1** `api/inference.py` β reuse `extract_sequences_from_video_bytes()` from `stream_ff_dataset.py`. Load model once, run forward pass on all sequences, average predictions across sequences. | |
| - [ ] **6.2** `api/main.py` β `/predict` endpoint (POST, multipart file upload) + `/health` endpoint. | |
| - [ ] **6.3** Load model at startup via FastAPI `lifespan`, not per-request. | |
| - [ ] **6.4** Add CORS for `http://localhost:5173`. | |
| - [ ] **6.5** `api/requirements.txt`: `fastapi>=0.111.0`, `uvicorn[standard]`, `python-multipart>=0.0.9` | |
| - [ ] **6.6** Test: `curl -X POST http://localhost:8000/predict -F "file=@test_video.mp4"` | |
| - [ ] **6.7** `scripts/start_api.sh`: | |
| ```bash | |
| source .venv311/bin/activate | |
| uvicorn api.main:app --reload --host 0.0.0.0 --port 8000 | |
| ``` | |
| --- | |
| ## Phase 7 β Demo Frontend | |
| **Goal:** React web UI for the defence demonstration. | |
| ### Stack: React + Vite + Tailwind + Recharts | |
| ``` | |
| frontend/ | |
| src/ | |
| App.jsx | |
| components/ | |
| VideoUploader.jsx | |
| ResultCard.jsx | |
| FrameChart.jsx | |
| AttentionViewer.jsx | |
| index.html | |
| package.json | |
| vite.config.js | |
| ``` | |
| ### Tasks | |
| - [ ] **7.1** `cd frontend && npm create vite@latest . -- --template react && npm install` | |
| - [ ] **7.2** `npm install tailwindcss recharts axios` | |
| - [ ] **7.3** `VideoUploader.jsx`: drag-and-drop or file picker for `.mp4/.avi/.mov`, video preview, "Analyse Video" button, loading spinner. | |
| - [ ] **7.4** `ResultCard.jsx`: REAL (green) / FAKE (red) verdict badge, confidence %, blink rate stat. | |
| - [ ] **7.5** `FrameChart.jsx`: Recharts line chart of per-frame fake probability, frames above 0.5 highlighted red. | |
| - [ ] **7.6** `AttentionViewer.jsx`: Grad-CAM attention overlay image from API response. | |
| - [ ] **7.7** Proxy in `vite.config.js`: `/predict` β `http://localhost:8000/predict` | |
| - [ ] **7.8** `frontend/.env`: `VITE_API_URL=http://localhost:8000` | |
| - [ ] **7.9** `scripts/start_frontend.sh`: | |
| ```bash | |
| cd frontend && npm run dev | |
| ``` | |
| --- | |
| ## Phase 8 β Integration & Final QA | |
| - [ ] **8.1** Run API + frontend together. Upload one of the `.npz` source videos as a test. | |
| - [ ] **8.2** Test with a real webcam recording β should return REAL. | |
| - [ ] **8.3** Fix any CORS issues. | |
| - [ ] **8.4** Create `docs/README_DEMO.md`: | |
| ``` | |
| 1. source .venv311/bin/activate | |
| 2. ./scripts/start_api.sh (Terminal 1) | |
| 3. ./scripts/start_frontend.sh (Terminal 2) | |
| 4. Open http://localhost:5173 | |
| ``` | |
| - [ ] **8.5** Document exact setup commands for a fresh machine. | |
| --- | |
| ## Project Directory Structure (Final) | |
| ``` | |
| deepfake-detector/ | |
| βββ configs/ | |
| β βββ base.yaml | |
| β βββ model/lrcn_vit.yaml | |
| β βββ train/aat_pgd.yaml | |
| βββ src/ | |
| β βββ data/ | |
| β β βββ stream_ff_dataset.py β NEW (replaces download-based flow) | |
| β β βββ extract_eye_sequences.py | |
| β β βββ dataset.py | |
| β βββ models/ | |
| β β βββ backbones.py | |
| β β βββ lrcn_vit.py | |
| β βββ train/ | |
| β β βββ train.py | |
| β β βββ adversarial.py | |
| β βββ eval/ | |
| β β βββ evaluate.py | |
| β β βββ ablation.py | |
| β β βββ plots.py | |
| β βββ viz/ | |
| β β βββ attention_maps.py | |
| β βββ utils.py | |
| βββ api/ | |
| β βββ main.py | |
| β βββ inference.py | |
| β βββ schemas.py | |
| β βββ requirements.txt | |
| βββ frontend/ | |
| β βββ src/ | |
| β β βββ App.jsx | |
| β β βββ components/ | |
| β βββ index.html | |
| β βββ package.json | |
| β βββ vite.config.js | |
| βββ data/ | |
| β βββ processed/ β .npz files only (~200MB), gitignored | |
| β βββ metadata.csv β generated, gitignored | |
| βββ outputs/ | |
| β βββ best.pt | |
| β βββ confusion_matrix.png | |
| β βββ roc_curve.png | |
| βββ scripts/ | |
| β βββ run_local.sh | |
| β βββ run_cloud.sh | |
| β βββ start_api.sh | |
| β βββ start_frontend.sh | |
| βββ tests/ | |
| β βββ test_model.py | |
| βββ docs/ | |
| β βββ reproducibility_checklist.md | |
| β βββ results_template.md | |
| β βββ README_DEMO.md | |
| βββ .gitignore | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |
| --- | |
| ## Suggestions & Overrides | |
| ### β οΈ Old files to DEPRECATE (keep but do not use) | |
| `src/data/build_metadata.py` and `src/data/extract_frames.py` were written for a local download workflow. They are superseded by `stream_ff_dataset.py`. Keep them in the repo for reference but do not run them. | |
| ### β οΈ ViT Input Resolution | |
| Frames are extracted at 224Γ224 directly in the streaming script. No resizing needed elsewhere. | |
| ### β οΈ Internet Required for Phase 1 | |
| The streaming script needs internet during the ~20β60 min preprocessing run. After that, everything runs offline from the `.npz` files. | |
| ### β οΈ Pre-trained Checkpoint Option | |
| Use `timm`'s pretrained ViT weights (ImageNet). Fine-tuning for 5β10 epochs on 400 videos is sufficient for a compelling defence demo. | |
| ### β Frontend: Keep it Simple | |
| Single-page upload β result. No auth, no database needed. |