Deepfake Detection β Multi-View Anomaly Ensemble
A 5-model soft-vote ensemble for binary deepfake detection, trained on FaceForensics++ C23 and zero-shot evaluated on Celeb-DF v2. Published as the production weights for the live demo at deepfakedetection.xyz.
Current revision: v2-2026-05-07
Rollback revision: v1-pre-update
What's in this repo
Five fine-tuned PyTorch checkpoints, each saved as {"state_dict": ..., "meta": ...}:
| File | Architecture | Pretraining | Input |
|---|---|---|---|
resnet18_best.pth |
ResNet-18 (2D) | ImageNet-1k | 16 frames @ 224Β² (MTCNN) |
efficientnet_b4_best.pth |
EfficientNet-B4 (2D) | ImageNet-1k | 16 frames @ 224Β² (MTCNN) |
r3d18_best.pth |
R3D-18 (3D) | Kinetics-400 | 16 frames @ 112Β² (MTCNN) |
vit_base_patch16_224_best.pth |
ViT-B/16 (2D) | ImageNet-21k β 1k | 16 frames @ 224Β² (MTCNN) |
r3d18_raft_best.pth |
R3D-18 (3D, same arch) | Kinetics-400 | 16 frames @ 112Β² (RAFT) |
Total ~691 MB. All five run on CPU; CUDA recommended for the 3D models.
Anomaly classes covered by the ensemble:
- Spatial β
resnet18,efficientnet_b4(per-frame texture / generation artifacts) - Global consistency β
vit_base_patch16_224(long-range face structure) - Temporal-motion β
r3d18(native 3D conv over RGB clips) - Motion-flow β
r3d18_raft(3D conv over RAFT optical-flow-interpolated clips)
The 2D backbones (ResNet-18, EfficientNet-B4, ViT-B/16) apply the backbone per frame and mean-pool features over the temporal dimension before a 2-class linear head. The 3D backbone (R3D-18) operates natively on (B, C, T, H, W) clips. R3D-18+RAFT shares the R3D-18 architecture but is trained on a different frame stream β RAFT optical-flow-interpolated crops that emphasise inter-frame motion.
Performance
In-distribution (FaceForensics++ C23 test, 900 videos)
| Model | Accuracy | F1 | AUC |
|---|---|---|---|
| ResNet-18 | 0.9878 | 0.9926 | 0.9999 |
| EfficientNet-B4 | 1.0000 | 1.0000 | 1.0000 |
| R3D-18 | 0.9889 | 0.9933 | 0.9995 |
| ViT-B/16 | 0.9889 | 0.9933 | 0.9998 |
| R3D-18 + RAFT | 0.9878 | 0.9926 | 0.9996 |
| Ensemble (soft-vote, Ο=0.1) | 1.0000 | 1.0000 | 1.0000 |
Cross-dataset zero-shot (Celeb-DF v2, 6,528 videos)
| Model | Celeb-DF AUC | Generalization Gap (Ξ_AUC) |
|---|---|---|
| EfficientNet-B4 | 0.8109 | 0.1891 |
| R3D-18 | 0.8383 | 0.1612 |
| ResNet-18 | 0.8427 | 0.1572 |
| ViT-B/16 | 0.8475 | 0.1523 |
| R3D-18 + RAFT | 0.8775 | 0.1221 |
| Ensemble (soft-vote, Ο=0.1) | 0.8851 | 0.1149 |
The ensemble reduces the generalization gap by 26.9% vs the ResNet-18 baseline (0.1572 β 0.1149) and beats every single model on Celeb-DF AUC.
Training recipe
All five models trained on a single Colab Pro NVIDIA L4 GPU with the same
two-stage transfer-learning procedure (src.training.train_two_stage in the
companion repo):
- Stage 1 β frozen backbone, head only (10 epochs)
- Stage 2 β full end-to-end fine-tune (10 epochs)
Best checkpoint (saved as *_best.pth) is the highest validation-accuracy state
across both stages combined (so an early stage-1 head can win if a stage-2 epoch
overfits). Per-epoch checkpoints also persisted for resumability against Colab disconnects.
| Component | Setting |
|---|---|
| Optimizer | torch.optim.Adam (no weight decay) |
| Loss | nn.CrossEntropyLoss over 2 classes (real, fake) |
| LR scheduler | None (constant LR per stage) |
| Mixed precision | torch.amp.GradScaler("cuda") enabled on CUDA |
| Frames per clip | 16 (MTCNN-cropped, or RAFT-interpolated for R3D-18+RAFT) |
| Train/val/test split | Identity-disjoint, video-level, FF++ C23 |
| Random seed | 42 |
Per-model hyperparameters
| Model | Batch | Stage-1 LR | Stage-2 LR | Train time |
|---|---|---|---|---|
| ResNet-18 | 16 | 1e-3 | 1e-4 | 79.3 min |
| EfficientNet-B4 | 8 | 1e-3 | 1e-4 | 106.7 min |
| R3D-18 | 8 | 1e-3 | 1e-4 | 174.6 min |
| R3D-18 + RAFT | 8 | 1e-3 | 1e-4 | 164.1 min |
| ViT-B/16 | 8 | 5e-4 | 5e-5 | 196.4 min |
ResNet-18 deliberately uses the A2 baseline recipe β normalize-only transforms and a plain shuffle DataLoader (no augmentation, no weighted sampler) β so it remains a clean reference point. The four advanced models add per-frame augmentation and class-weighted sampling.
How to use
import torch
from huggingface_hub import hf_hub_download
REPO = "abraraltaf92/deepfake-detection-models"
# 1. Download a checkpoint (cached). Pin a revision for reproducibility:
weights_path = hf_hub_download(
repo_id=REPO,
filename="resnet18_best.pth",
revision="v2-2026-05-07",
)
# 2. Build the architecture (reference impl in the deepfake-detection-app backend).
model = build_resnet18_deepfake_detector() # binary head: real / fake
# 3. Load fine-tuned weights β wrapped {"state_dict": ..., "meta": ...}
data = torch.load(weights_path, map_location="cpu", weights_only=False)
model.load_state_dict(data["state_dict"])
print("best_val_acc:", data["meta"].get("best_val_acc"))
model.eval()
The full ensemble inference pipeline (MTCNN face extraction β RAFT optical-flow β
5-model soft-vote β Grad-CAM) lives in the
deepfake-detection-app
backend.
Decision threshold
The ensemble averages per-model fake-class probabilities and applies a calibrated decision threshold of Ο = 0.1. This was chosen to maximise F1 on the Celeb-DF v2 cross-dataset eval, where high-capacity static-image models (EfficientNet-B4, ViT-B/16) tend to under-call fakes that are caught by the motion-aware models. A low threshold lets one or two confident "fake" votes outweigh several uncertain "real" votes.
Limitations and intended use
- Trained on FaceForensics++ C23 only β primarily Deepfakes / Face2Face / FaceSwap / NeuralTextures manipulations on YouTube faces.
- Cross-dataset AUC of 0.885 on Celeb-DF v2 is strong but not perfect; expect false-positive and false-negative rates ~10β15% on out-of-distribution clips.
- Faces only β the upstream pipeline assumes MTCNN can detect at least one face per frame. Animation, heavily-occluded faces, or non-frontal poses are out of scope.
- Not a forensic tool. Use for research, demos, and educational content moderation. Do not rely on this for evidentiary or legal purposes.
- Bias considerations: FF++ skews towards Western, well-lit, frontal-pose faces. Performance on under-represented demographics has not been formally audited.
Citation
If you use these weights, please cite:
@misc{zahid2026deepfake,
title = {Robust Deepfake Detection via Multi-View Anomaly Ensemble},
author = {Zahid, Muhammad Abdullah and Lone, Abrar Altaf and Maniyar, Krishna Kirit and Shergill, Sajan Singh},
year = {2026},
note = {CS 668 Capstone, Pace University},
url = {https://github.com/abraraltaf92/deepfake-detection}
}
License
MIT β see LICENSE in the deepfake-detection-app repository.
Related links
- Live demo: deepfakedetection.xyz
- Source code: github.com/abraraltaf92/deepfake-detection
- Web app: github.com/abraraltaf92/deepfake-detection-app
- Datasets: FaceForensics++ Β· Celeb-DF v2