Spaces:
Sleeping
DeepFake Eye-Blink Detection β Cursor AI Build Plan
Project Overview
An Enhanced Eye-Blinking LRCN (Long-term Recurrent ConvNet) for DeepFake detection using Attentive Adversarial Training (AAT) with a Vision Transformer (ViT) backbone. The research is by Alina Chikwado Godsaves under supervision of Mr. Akanji.
The goal is to detect deepfake videos by analyzing unnatural eye-blinking patterns and fine-grained ocular artifacts (eyelid dynamics, pupil reflections) using a hybrid CNN/LSTM + ViT model hardened with adversarial training.
Python version: 3.11 (PyTorch does not install on 3.13)
What Is Already Scaffolded (Do NOT Recreate)
All files below exist but need to be verified, completed, and wired together:
| Area | Files | Status |
|---|---|---|
| Config system | configs/base.yaml, configs/model/lrcn_vit.yaml, configs/train/aat_pgd.yaml |
β Exists |
| Data pipeline | src/data/build_metadata.py, src/data/extract_frames.py, src/data/extract_eye_sequences.py, src/data/dataset.py |
β Exists, verify |
| Model | src/models/backbones.py, src/models/lrcn_vit.py |
β Exists, verify |
| Training | src/train/train.py, src/train/adversarial.py |
β Exists, verify |
| Evaluation | src/eval/evaluate.py, src/eval/ablation.py, src/eval/plots.py |
β Exists, verify |
| Explainability | src/viz/attention_maps.py |
β Exists |
| Scripts | scripts/run_local.sh, scripts/run_cloud.sh |
β Exists |
| Docs | docs/reproducibility_checklist.md, docs/results_template.md |
β Exists |
Phase 0 β Environment & Dependency Fix (FIRST PRIORITY)
Goal: Get a working Python 3.11 venv with all ML/CV deps installed.
Tasks
- 0.1 Confirm
python3.11is available, or install viapyenv/ system package manager - 0.2 Create venv:
python3.11 -m venv .venv311 && source .venv311/bin/activate - 0.3 Pin exact working versions in
requirements.txt:torch==2.2.2 torchvision==0.17.2 timm==0.9.16 opencv-python-headless==4.9.0.80 mediapipe==0.10.11 pandas==2.2.2 numpy==1.26.4 scikit-learn==1.4.2 matplotlib==3.8.4 seaborn==0.13.2 tqdm==4.66.4 pytorch-grad-cam==1.5.0 Pillow==10.3.0 pyyaml==6.0.1 albumentations==1.4.3 einops==0.7.0 wandb==0.17.0 datasets==2.19.0 huggingface_hub==0.23.0 av==12.0.0 - 0.4 Run
pip install -r requirements.txtinside venv and confirm zero errors - 0.5 Smoke-test:
python -c "import torch; import timm; import mediapipe; import datasets; print('OK')" - 0.6 Update
scripts/run_local.shto activate.venv311before any python calls - 0.7 One-time HuggingFace login (only needed once per machine):
huggingface-cli login # Paste your token from https://huggingface.co/settings/tokens # Token needs Read access only
Phase 1 β Dataset via HuggingFace Streaming (NO DOWNLOAD NEEDED)
Goal: Stream FaceForensics++ c23 videos directly from HuggingFace one at a time, extract eye sequences into tiny .npz files, and discard each video. No raw videos are ever stored on disk.
How Streaming Works
HuggingFace server
β sends video #1 to RAM (temp, ~5MB)
β MediaPipe extracts eye crops + EAR signal
β saves tiny .npz file (~50KB) to data/processed/
β video is gone from memory
β repeat for video #2, #3 ... #200
At the end: ~200 .npz files totalling ~100β300MB. Zero raw videos on disk.
Dataset
Source: bitmind/FaceForensicsC23 on HuggingFace
URL: https://huggingface.co/datasets/bitmind/FaceForensicsC23
Contents: 7,000 MP4 videos β 1,000 real + 6,000 deepfakes (Deepfakes, Face2Face, FaceShifter, FaceSwap, NeuralTextures, DeepFakeDetection), c23 compression
We use: 200 real (/Real/) + 200 fake (/Deepfakes/) = 400 videos total
Tasks
- 1.1 Create
src/data/stream_ff_dataset.pyβ a NEW script that replaces the old download-basedbuild_metadata.py+extract_frames.pyflow:
"""
Stream FaceForensics++ c23 from HuggingFace.
Downloads one video at a time into RAM, extracts eye sequences,
saves .npz files, discards the video. No raw videos stored on disk.
Usage:
python -m src.data.stream_ff_dataset \
--out-root data/processed \
--num-real 200 \
--num-fake 200
"""
import io, tempfile, os, csv
import numpy as np
import cv2
import mediapipe as mp
from datasets import load_dataset
from tqdm import tqdm
from pathlib import Path
HF_DATASET = "bitmind/FaceForensicsC23"
REAL_PATH_MARKER = "/Real/"
FAKE_PATH_MARKER = "/Deepfakes/" # use only Deepfakes subfolder, not all 6
def compute_ear(landmarks, eye_indices):
"""Compute Eye Aspect Ratio from MediaPipe landmarks."""
# eye_indices: [p1, p2, p3, p4, p5, p6]
p = [landmarks[i] for i in eye_indices]
A = np.linalg.norm(np.array([p[1].x, p[1].y]) - np.array([p[5].x, p[5].y]))
B = np.linalg.norm(np.array([p[2].x, p[2].y]) - np.array([p[4].x, p[4].y]))
C = np.linalg.norm(np.array([p[0].x, p[0].y]) - np.array([p[3].x, p[3].y]))
return (A + B) / (2.0 * C + 1e-6)
def extract_sequences_from_video_bytes(video_bytes, label, video_id, seq_len=16):
"""
Given raw video bytes, extract overlapping eye-region sequences.
Returns list of dicts: {'frames': (T,H,W,3), 'ear': (T,), 'label': int, 'video_id': str}
"""
# Write to a temp file so OpenCV can read it
with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as f:
f.write(video_bytes)
tmp_path = f.name
sequences = []
face_mesh = mp.solutions.face_mesh.FaceMesh(
static_image_mode=False, max_num_faces=1, refine_landmarks=True
)
# MediaPipe eye landmark indices (left eye outerβinner, right eye similar)
LEFT_EYE = [33, 160, 158, 133, 153, 144]
RIGHT_EYE = [362, 385, 387, 263, 373, 380]
cap = cv2.VideoCapture(tmp_path)
fps = cap.get(cv2.CAP_PROP_FPS) or 30
frame_interval = max(1, int(fps / 10)) # sample at ~10fps
all_frames, all_ears = [], []
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
if frame_idx % frame_interval == 0:
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
result = face_mesh.process(rgb)
if result.multi_face_landmarks:
lm = result.multi_face_landmarks[0].landmark
h, w = frame.shape[:2]
# Compute EAR (average of both eyes)
left_ear = compute_ear(lm, LEFT_EYE)
right_ear = compute_ear(lm, RIGHT_EYE)
ear = (left_ear + right_ear) / 2.0
# Crop eye region: bounding box around both eyes
eye_pts = [lm[i] for i in LEFT_EYE + RIGHT_EYE]
xs = [int(p.x * w) for p in eye_pts]
ys = [int(p.y * h) for p in eye_pts]
x1, x2 = max(0, min(xs) - 20), min(w, max(xs) + 20)
y1, y2 = max(0, min(ys) - 20), min(h, max(ys) + 20)
crop = rgb[y1:y2, x1:x2]
if crop.size > 0:
crop = cv2.resize(crop, (224, 224)) # ViT input size
all_frames.append(crop)
all_ears.append(ear)
frame_idx += 1
cap.release()
face_mesh.close()
os.unlink(tmp_path) # delete temp file immediately
# Slice into non-overlapping sequences of length seq_len
for i in range(0, len(all_frames) - seq_len + 1, seq_len):
frames = np.stack(all_frames[i:i+seq_len]).astype(np.uint8)
ears = np.array(all_ears[i:i+seq_len], dtype=np.float32)
sequences.append({
'frames': frames,
'ear': ears,
'label': label,
'video_id': f"{video_id}_seq{i}"
})
return sequences
def stream_and_extract(out_root, num_real=200, num_fake=200, seq_len=16):
out_root = Path(out_root)
out_root.mkdir(parents=True, exist_ok=True)
# Stream dataset β never downloads the full zip
ds = load_dataset(HF_DATASET, streaming=True, split="train")
real_count, fake_count = 0, 0
metadata_rows = []
pbar = tqdm(total=num_real + num_fake, desc="Streaming videos")
for item in ds:
video_path_str = str(item.get('video', ''))
is_real = REAL_PATH_MARKER in video_path_str and real_count < num_real
is_fake = FAKE_PATH_MARKER in video_path_str and fake_count < num_fake
if not is_real and not is_fake:
continue
label = 0 if is_real else 1
video_id = Path(video_path_str).stem
# item['video'] is a dict with 'bytes' key when streaming
video_bytes = item['video']['bytes'] if isinstance(item['video'], dict) else None
if video_bytes is None:
continue
sequences = extract_sequences_from_video_bytes(
video_bytes, label, video_id, seq_len
)
for seq in sequences:
npz_name = f"{seq['video_id']}.npz"
npz_path = out_root / npz_name
np.savez_compressed(
npz_path,
frames=seq['frames'],
ear=seq['ear'],
label=np.array(seq['label']),
video_id=np.array(seq['video_id'])
)
metadata_rows.append({
'npz_path': str(npz_path),
'label': label,
'video_id': video_id,
'split': 'train' # will be reassigned below
})
if is_real:
real_count += 1
else:
fake_count += 1
pbar.update(1)
if real_count >= num_real and fake_count >= num_fake:
break
pbar.close()
# Assign splits: 70% train, 15% val, 15% test (by video_id, not sequence)
unique_ids = list({r['video_id'] for r in metadata_rows})
np.random.shuffle(unique_ids)
n = len(unique_ids)
train_ids = set(unique_ids[:int(0.7 * n)])
val_ids = set(unique_ids[int(0.7 * n):int(0.85 * n)])
for row in metadata_rows:
if row['video_id'] in train_ids:
row['split'] = 'train'
elif row['video_id'] in val_ids:
row['split'] = 'val'
else:
row['split'] = 'test'
# Write metadata CSV
csv_path = Path('data/metadata.csv')
csv_path.parent.mkdir(exist_ok=True)
with open(csv_path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['npz_path', 'label', 'video_id', 'split'])
writer.writeheader()
writer.writerows(metadata_rows)
print(f"\nDone! {real_count} real + {fake_count} fake videos processed.")
print(f"Total sequences: {len(metadata_rows)}")
print(f"Metadata written to: {csv_path}")
print(f"Sequences saved to: {out_root}")
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--out-root', default='data/processed')
parser.add_argument('--num-real', type=int, default=200)
parser.add_argument('--num-fake', type=int, default=200)
parser.add_argument('--seq-len', type=int, default=16)
args = parser.parse_args()
stream_and_extract(args.out_root, args.num_real, args.num_fake, args.seq_len)
1.2 Run the streaming script:
source .venv311/bin/activate python -m src.data.stream_ff_dataset \ --out-root data/processed \ --num-real 200 \ --num-fake 200This will run for ~20β60 minutes depending on internet speed. It streams each video, processes it, saves a tiny
.npz, and moves on. Your terminal will show a progress bar.1.3 When done, verify output:
ls data/processed/ | wc -l # should be several hundred .npz files du -sh data/processed/ # should be ~100-300MB total python -c " import numpy as np d = np.load('data/processed/' + __import__('os').listdir('data/processed')[0], allow_pickle=True) print('frames:', d['frames'].shape) # expect (16, 224, 224, 3) print('ear:', d['ear'].shape) # expect (16,) print('label:', d['label']) # expect 0 or 1 "1.4 Verify
data/metadata.csvhas rows withnpz_path,label,video_id,splitcolumns and a healthy mix of train/val/test rows1.5 Update
src/data/dataset.pyto read fromdata/metadata.csv(pointing to.npzfiles) instead of from raw video paths. The__getitem__contract remains unchanged:{'frames': Tensor[T,3,224,224], 'ear': Tensor[T], 'label': int}1.6 Update
configs/base.yaml:data: metadata_csv: data/metadata.csv processed_root: data/processed seq_len: 16 img_size: 2241.7 Add to
.gitignore:data/processed/ data/raw/ data/metadata.csv outputs/ *.npz *.pt .venv311/
Phase 2 β Dataset Loader Verification
Goal: Confirm src/data/dataset.py correctly reads the .npz files produced by streaming.
Tasks
- 2.1 Open
src/data/dataset.pyβ update it to read frommetadata.csvinstead of raw video paths. Each row'snpz_pathpoints directly to a processed sequence file. - 2.2 Add
albumentationsaugmentations for training split only:import albumentations as A from albumentations.pytorch import ToTensorV2 train_transform = A.Compose([ A.HorizontalFlip(p=0.5), A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, p=0.5), A.GaussNoise(var_limit=(10.0, 50.0), p=0.3), A.ImageCompression(quality_lower=70, quality_upper=100, p=0.3), A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ToTensorV2(), ]) - 2.3 Smoke-test the DataLoader:
from src.data.dataset import EyeBlinkDataset ds = EyeBlinkDataset('data/metadata.csv', split='train') sample = ds[0] assert sample['frames'].shape == (16, 3, 224, 224) assert sample['ear'].shape == (16,) assert sample['label'] in [0, 1] print("DataLoader OK")
Phase 3 β Model Architecture Verification & Fix
Goal: Ensure the LRCN + ViT hybrid model is correctly implemented and matches the research proposal.
Architecture Spec (from proposal)
Input: eye-region sequence (T=16 frames, each 224Γ224 RGB) + EAR signal (T floats)
β
ViT Backbone (timm: vit_small_patch16_224, pretrained=True)
β Per-frame [CLS] token β shape (T, 384)
β
LSTM Temporal Encoder
β Hidden size: 256, Num layers: 2, Dropout: 0.3
β
Blink Dynamics Head
β Concatenate LSTM output + EAR
β FC(257, 128) β ReLU
β Blink timing constraint (0.1β0.4s window)
β
Classifier Head
β FC(256, 128) β ReLU β Dropout(0.5) β FC(128, 2)
β Output: [real_logit, fake_logit]
Tasks
- 3.1 Open
src/models/backbones.pyβ verifybuild_backbone(config)returns a timm ViT. Forvit_small_patch16_224embed dim = 384. - 3.2 Open
src/models/lrcn_vit.pyβ verify forward pass. Frames arrive as(B, T, 3, 224, 224). Reshape to(B*T, 3, 224, 224)before ViT, then reshape back to(B, T, embed_dim)before LSTM. - 3.3 Add attention consistency loss: KL-divergence between adjacent frame ViT attention maps, weighted by
lambda_attn. - 3.4 Add blink timing regularizer: penalize uncertain predictions when EAR < 0.2 but blink duration is outside 0.1β0.4s. Weight:
lambda_blink. - 3.5 Add unit test in
tests/test_model.py:model = LRCNViT(config) dummy = {'frames': torch.randn(2, 16, 3, 224, 224), 'ear': torch.randn(2, 16)} out = model(dummy) assert out['logits'].shape == (2, 2)
Phase 4 β Training Loop Fix & Wire-Up
Goal: Get the full training loop running end-to-end with adversarial training and all loss components.
Tasks
- 4.1 Open
src/train/train.pyβ verify it loads config, DataLoader, model, AdamW, LR scheduler, and savesoutputs/best.pton val AUC improvement. - 4.2 Wire in
wandb: ifconfig.wandb.enabled: true, callwandb.init()and log metrics each epoch. - 4.3 Total loss formula:
L_total = L_ce(clean) + alpha * L_ce(adversarial) + lambda_attn * L_attn_consistency + lambda_blink * L_blink_regularizer - 4.4 Open
src/train/adversarial.pyβ verify PGD:eps=8/255,steps=10, applied only to eye-region frames. - 4.5 Add gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) - 4.6 Update
configs/train/aat_pgd.yaml:epochs: 30 batch_size: 16 lr: 3e-4 weight_decay: 1e-4 alpha: 0.5 lambda_attn: 0.1 lambda_blink: 0.05 pgd_eps: 0.031 pgd_steps: 10 wandb: enabled: false project: "deepfake-eye-blink" - 4.7 Smoke-train: 2 epochs on 50 samples β confirm zero errors.
- 4.8 Full training:
python -m src.train.train --config configs/train/aat_pgd.yaml
Phase 5 β Evaluation & Ablation
Goal: Produce evaluation numbers and ablation table for the thesis.
Tasks
- 5.1 Open
src/eval/evaluate.pyβ verify it outputs Accuracy, Precision, Recall, F1, AUC. - 5.2 Run:
python -m src.eval.evaluate --checkpoint outputs/best.pt --config configs/train/aat_pgd.yaml - 5.3 Open
src/eval/ablation.pyβ confirm 4 configs: Full / No AAT / No ViT / No blink regularizer. - 5.4 Run ablation:
python -m src.eval.ablation --config configs/train/aat_pgd.yaml - 5.5 Open
src/eval/plots.pyβ confirm it generatesconfusion_matrix.pngandroc_curve.png. - 5.6 Fill in
docs/results_template.mdwith actual numbers.
Phase 6 β Inference API
Goal: FastAPI server that accepts an uploaded video and returns a prediction.
New files
api/
main.py
inference.py # reuses the same eye extraction logic from stream_ff_dataset.py
schemas.py
requirements.txt
Tasks
- 6.1
api/inference.pyβ reuseextract_sequences_from_video_bytes()fromstream_ff_dataset.py. Load model once, run forward pass on all sequences, average predictions across sequences. - 6.2
api/main.pyβ/predictendpoint (POST, multipart file upload) +/healthendpoint. - 6.3 Load model at startup via FastAPI
lifespan, not per-request. - 6.4 Add CORS for
http://localhost:5173. - 6.5
api/requirements.txt:fastapi>=0.111.0,uvicorn[standard],python-multipart>=0.0.9 - 6.6 Test:
curl -X POST http://localhost:8000/predict -F "file=@test_video.mp4" - 6.7
scripts/start_api.sh:source .venv311/bin/activate uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
Phase 7 β Demo Frontend
Goal: React web UI for the defence demonstration.
Stack: React + Vite + Tailwind + Recharts
frontend/
src/
App.jsx
components/
VideoUploader.jsx
ResultCard.jsx
FrameChart.jsx
AttentionViewer.jsx
index.html
package.json
vite.config.js
Tasks
- 7.1
cd frontend && npm create vite@latest . -- --template react && npm install - 7.2
npm install tailwindcss recharts axios - 7.3
VideoUploader.jsx: drag-and-drop or file picker for.mp4/.avi/.mov, video preview, "Analyse Video" button, loading spinner. - 7.4
ResultCard.jsx: REAL (green) / FAKE (red) verdict badge, confidence %, blink rate stat. - 7.5
FrameChart.jsx: Recharts line chart of per-frame fake probability, frames above 0.5 highlighted red. - 7.6
AttentionViewer.jsx: Grad-CAM attention overlay image from API response. - 7.7 Proxy in
vite.config.js:/predictβhttp://localhost:8000/predict - 7.8
frontend/.env:VITE_API_URL=http://localhost:8000 - 7.9
scripts/start_frontend.sh:cd frontend && npm run dev
Phase 8 β Integration & Final QA
- 8.1 Run API + frontend together. Upload one of the
.npzsource videos as a test. - 8.2 Test with a real webcam recording β should return REAL.
- 8.3 Fix any CORS issues.
- 8.4 Create
docs/README_DEMO.md:1. source .venv311/bin/activate 2. ./scripts/start_api.sh (Terminal 1) 3. ./scripts/start_frontend.sh (Terminal 2) 4. Open http://localhost:5173 - 8.5 Document exact setup commands for a fresh machine.
Project Directory Structure (Final)
deepfake-detector/
βββ configs/
β βββ base.yaml
β βββ model/lrcn_vit.yaml
β βββ train/aat_pgd.yaml
βββ src/
β βββ data/
β β βββ stream_ff_dataset.py β NEW (replaces download-based flow)
β β βββ extract_eye_sequences.py
β β βββ dataset.py
β βββ models/
β β βββ backbones.py
β β βββ lrcn_vit.py
β βββ train/
β β βββ train.py
β β βββ adversarial.py
β βββ eval/
β β βββ evaluate.py
β β βββ ablation.py
β β βββ plots.py
β βββ viz/
β β βββ attention_maps.py
β βββ utils.py
βββ api/
β βββ main.py
β βββ inference.py
β βββ schemas.py
β βββ requirements.txt
βββ frontend/
β βββ src/
β β βββ App.jsx
β β βββ components/
β βββ index.html
β βββ package.json
β βββ vite.config.js
βββ data/
β βββ processed/ β .npz files only (~200MB), gitignored
β βββ metadata.csv β generated, gitignored
βββ outputs/
β βββ best.pt
β βββ confusion_matrix.png
β βββ roc_curve.png
βββ scripts/
β βββ run_local.sh
β βββ run_cloud.sh
β βββ start_api.sh
β βββ start_frontend.sh
βββ tests/
β βββ test_model.py
βββ docs/
β βββ reproducibility_checklist.md
β βββ results_template.md
β βββ README_DEMO.md
βββ .gitignore
βββ requirements.txt
βββ README.md
Suggestions & Overrides
β οΈ Old files to DEPRECATE (keep but do not use)
src/data/build_metadata.py and src/data/extract_frames.py were written for a local download workflow. They are superseded by stream_ff_dataset.py. Keep them in the repo for reference but do not run them.
β οΈ ViT Input Resolution
Frames are extracted at 224Γ224 directly in the streaming script. No resizing needed elsewhere.
β οΈ Internet Required for Phase 1
The streaming script needs internet during the ~20β60 min preprocessing run. After that, everything runs offline from the .npz files.
β οΈ Pre-trained Checkpoint Option
Use timm's pretrained ViT weights (ImageNet). Fine-tuning for 5β10 epochs on 400 videos is sufficient for a compelling defence demo.
β Frontend: Keep it Simple
Single-page upload β result. No auth, no database needed.