Deepfake Detection — Multi-View Anomaly Ensemble

A 5-model soft-vote ensemble for binary deepfake detection, trained on FaceForensics++ C23 and zero-shot evaluated on Celeb-DF v2. Published as the production weights for the live demo at deepfakedetection.xyz.

Current revision: v2-2026-05-07 Rollback revision: v1-pre-update

What's in this repo

Five fine-tuned PyTorch checkpoints, each saved as {"state_dict": ..., "meta": ...}:

File	Architecture	Pretraining	Input
`resnet18_best.pth`	ResNet-18 (2D)	ImageNet-1k	16 frames @ 224² (MTCNN)
`efficientnet_b4_best.pth`	EfficientNet-B4 (2D)	ImageNet-1k	16 frames @ 224² (MTCNN)
`r3d18_best.pth`	R3D-18 (3D)	Kinetics-400	16 frames @ 112² (MTCNN)
`vit_base_patch16_224_best.pth`	ViT-B/16 (2D)	ImageNet-21k → 1k	16 frames @ 224² (MTCNN)
`r3d18_raft_best.pth`	R3D-18 (3D, same arch)	Kinetics-400	16 frames @ 112² (RAFT)

Total ~691 MB. All five run on CPU; CUDA recommended for the 3D models.

Anomaly classes covered by the ensemble:

Spatial — resnet18, efficientnet_b4 (per-frame texture / generation artifacts)
Global consistency — vit_base_patch16_224 (long-range face structure)
Temporal-motion — r3d18 (native 3D conv over RGB clips)
Motion-flow — r3d18_raft (3D conv over RAFT optical-flow-interpolated clips)

The 2D backbones (ResNet-18, EfficientNet-B4, ViT-B/16) apply the backbone per frame and mean-pool features over the temporal dimension before a 2-class linear head. The 3D backbone (R3D-18) operates natively on (B, C, T, H, W) clips. R3D-18+RAFT shares the R3D-18 architecture but is trained on a different frame stream — RAFT optical-flow-interpolated crops that emphasise inter-frame motion.

Performance

In-distribution (FaceForensics++ C23 test, 900 videos)

Model	Accuracy	F1	AUC
ResNet-18	0.9878	0.9926	0.9999
EfficientNet-B4	1.0000	1.0000	1.0000
R3D-18	0.9889	0.9933	0.9995
ViT-B/16	0.9889	0.9933	0.9998
R3D-18 + RAFT	0.9878	0.9926	0.9996
Ensemble (soft-vote, τ=0.1)	1.0000	1.0000	1.0000

Cross-dataset zero-shot (Celeb-DF v2, 6,528 videos)

Model	Celeb-DF AUC	Generalization Gap (Δ_AUC)
EfficientNet-B4	0.8109	0.1891
R3D-18	0.8383	0.1612
ResNet-18	0.8427	0.1572
ViT-B/16	0.8475	0.1523
R3D-18 + RAFT	0.8775	0.1221
Ensemble (soft-vote, τ=0.1)	0.8851	0.1149

The ensemble reduces the generalization gap by 26.9% vs the ResNet-18 baseline (0.1572 → 0.1149) and beats every single model on Celeb-DF AUC.

Training recipe

All five models trained on a single Colab Pro NVIDIA L4 GPU with the same two-stage transfer-learning procedure (src.training.train_two_stage in the companion repo):

Stage 1 — frozen backbone, head only (10 epochs)
Stage 2 — full end-to-end fine-tune (10 epochs)

Best checkpoint (saved as *_best.pth) is the highest validation-accuracy state across both stages combined (so an early stage-1 head can win if a stage-2 epoch overfits). Per-epoch checkpoints also persisted for resumability against Colab disconnects.

Component	Setting
Optimizer	`torch.optim.Adam` (no weight decay)
Loss	`nn.CrossEntropyLoss` over 2 classes (real, fake)
LR scheduler	None (constant LR per stage)
Mixed precision	`torch.amp.GradScaler("cuda")` enabled on CUDA
Frames per clip	16 (MTCNN-cropped, or RAFT-interpolated for R3D-18+RAFT)
Train/val/test split	Identity-disjoint, video-level, FF++ C23
Random seed	42

Per-model hyperparameters

Model	Batch	Stage-1 LR	Stage-2 LR	Train time
ResNet-18	16	1e-3	1e-4	79.3 min
EfficientNet-B4	8	1e-3	1e-4	106.7 min
R3D-18	8	1e-3	1e-4	174.6 min
R3D-18 + RAFT	8	1e-3	1e-4	164.1 min
ViT-B/16	8	5e-4	5e-5	196.4 min

ResNet-18 deliberately uses the A2 baseline recipe — normalize-only transforms and a plain shuffle DataLoader (no augmentation, no weighted sampler) — so it remains a clean reference point. The four advanced models add per-frame augmentation and class-weighted sampling.

How to use

import torch
from huggingface_hub import hf_hub_download

REPO = "abraraltaf92/deepfake-detection-models"

# 1. Download a checkpoint (cached). Pin a revision for reproducibility:
weights_path = hf_hub_download(
    repo_id=REPO,
    filename="resnet18_best.pth",
    revision="v2-2026-05-07",
)

# 2. Build the architecture (reference impl in the deepfake-detection-app backend).
model = build_resnet18_deepfake_detector()  # binary head: real / fake

# 3. Load fine-tuned weights — wrapped {"state_dict": ..., "meta": ...}
data = torch.load(weights_path, map_location="cpu", weights_only=False)
model.load_state_dict(data["state_dict"])
print("best_val_acc:", data["meta"].get("best_val_acc"))
model.eval()

The full ensemble inference pipeline (MTCNN face extraction → RAFT optical-flow → 5-model soft-vote → Grad-CAM) lives in the deepfake-detection-app backend.

Decision threshold

The ensemble averages per-model fake-class probabilities and applies a calibrated decision threshold of τ = 0.1. This was chosen to maximise F1 on the Celeb-DF v2 cross-dataset eval, where high-capacity static-image models (EfficientNet-B4, ViT-B/16) tend to under-call fakes that are caught by the motion-aware models. A low threshold lets one or two confident "fake" votes outweigh several uncertain "real" votes.

Limitations and intended use

Trained on FaceForensics++ C23 only — primarily Deepfakes / Face2Face / FaceSwap / NeuralTextures manipulations on YouTube faces.
Cross-dataset AUC of 0.885 on Celeb-DF v2 is strong but not perfect; expect false-positive and false-negative rates ~10–15% on out-of-distribution clips.
Faces only — the upstream pipeline assumes MTCNN can detect at least one face per frame. Animation, heavily-occluded faces, or non-frontal poses are out of scope.
Not a forensic tool. Use for research, demos, and educational content moderation. Do not rely on this for evidentiary or legal purposes.
Bias considerations: FF++ skews towards Western, well-lit, frontal-pose faces. Performance on under-represented demographics has not been formally audited.

Citation

If you use these weights, please cite:

@misc{zahid2026deepfake,
  title  = {Robust Deepfake Detection via Multi-View Anomaly Ensemble},
  author = {Zahid, Muhammad Abdullah and Lone, Abrar Altaf and Maniyar, Krishna Kirit and Shergill, Sajan Singh},
  year   = {2026},
  note   = {CS 668 Capstone, Pace University},
  url    = {https://github.com/abraraltaf92/deepfake-detection}
}

License

MIT — see LICENSE in the deepfake-detection-app repository.

abraraltaf92
/

deepfake-detection-models