Deepfake Detection β€” Multi-View Anomaly Ensemble

A 5-model soft-vote ensemble for binary deepfake detection, trained on FaceForensics++ C23 and zero-shot evaluated on Celeb-DF v2. Published as the production weights for the live demo at deepfakedetection.xyz.

Current revision: v2-2026-05-07 Rollback revision: v1-pre-update

What's in this repo

Five fine-tuned PyTorch checkpoints, each saved as {"state_dict": ..., "meta": ...}:

File Architecture Pretraining Input
resnet18_best.pth ResNet-18 (2D) ImageNet-1k 16 frames @ 224Β² (MTCNN)
efficientnet_b4_best.pth EfficientNet-B4 (2D) ImageNet-1k 16 frames @ 224Β² (MTCNN)
r3d18_best.pth R3D-18 (3D) Kinetics-400 16 frames @ 112Β² (MTCNN)
vit_base_patch16_224_best.pth ViT-B/16 (2D) ImageNet-21k β†’ 1k 16 frames @ 224Β² (MTCNN)
r3d18_raft_best.pth R3D-18 (3D, same arch) Kinetics-400 16 frames @ 112Β² (RAFT)

Total ~691 MB. All five run on CPU; CUDA recommended for the 3D models.

Anomaly classes covered by the ensemble:

  • Spatial β€” resnet18, efficientnet_b4 (per-frame texture / generation artifacts)
  • Global consistency β€” vit_base_patch16_224 (long-range face structure)
  • Temporal-motion β€” r3d18 (native 3D conv over RGB clips)
  • Motion-flow β€” r3d18_raft (3D conv over RAFT optical-flow-interpolated clips)

The 2D backbones (ResNet-18, EfficientNet-B4, ViT-B/16) apply the backbone per frame and mean-pool features over the temporal dimension before a 2-class linear head. The 3D backbone (R3D-18) operates natively on (B, C, T, H, W) clips. R3D-18+RAFT shares the R3D-18 architecture but is trained on a different frame stream β€” RAFT optical-flow-interpolated crops that emphasise inter-frame motion.

Performance

In-distribution (FaceForensics++ C23 test, 900 videos)

Model Accuracy F1 AUC
ResNet-18 0.9878 0.9926 0.9999
EfficientNet-B4 1.0000 1.0000 1.0000
R3D-18 0.9889 0.9933 0.9995
ViT-B/16 0.9889 0.9933 0.9998
R3D-18 + RAFT 0.9878 0.9926 0.9996
Ensemble (soft-vote, Ο„=0.1) 1.0000 1.0000 1.0000

Cross-dataset zero-shot (Celeb-DF v2, 6,528 videos)

Model Celeb-DF AUC Generalization Gap (Ξ”_AUC)
EfficientNet-B4 0.8109 0.1891
R3D-18 0.8383 0.1612
ResNet-18 0.8427 0.1572
ViT-B/16 0.8475 0.1523
R3D-18 + RAFT 0.8775 0.1221
Ensemble (soft-vote, Ο„=0.1) 0.8851 0.1149

The ensemble reduces the generalization gap by 26.9% vs the ResNet-18 baseline (0.1572 β†’ 0.1149) and beats every single model on Celeb-DF AUC.

Training recipe

All five models trained on a single Colab Pro NVIDIA L4 GPU with the same two-stage transfer-learning procedure (src.training.train_two_stage in the companion repo):

  1. Stage 1 β€” frozen backbone, head only (10 epochs)
  2. Stage 2 β€” full end-to-end fine-tune (10 epochs)

Best checkpoint (saved as *_best.pth) is the highest validation-accuracy state across both stages combined (so an early stage-1 head can win if a stage-2 epoch overfits). Per-epoch checkpoints also persisted for resumability against Colab disconnects.

Component Setting
Optimizer torch.optim.Adam (no weight decay)
Loss nn.CrossEntropyLoss over 2 classes (real, fake)
LR scheduler None (constant LR per stage)
Mixed precision torch.amp.GradScaler("cuda") enabled on CUDA
Frames per clip 16 (MTCNN-cropped, or RAFT-interpolated for R3D-18+RAFT)
Train/val/test split Identity-disjoint, video-level, FF++ C23
Random seed 42

Per-model hyperparameters

Model Batch Stage-1 LR Stage-2 LR Train time
ResNet-18 16 1e-3 1e-4 79.3 min
EfficientNet-B4 8 1e-3 1e-4 106.7 min
R3D-18 8 1e-3 1e-4 174.6 min
R3D-18 + RAFT 8 1e-3 1e-4 164.1 min
ViT-B/16 8 5e-4 5e-5 196.4 min

ResNet-18 deliberately uses the A2 baseline recipe β€” normalize-only transforms and a plain shuffle DataLoader (no augmentation, no weighted sampler) β€” so it remains a clean reference point. The four advanced models add per-frame augmentation and class-weighted sampling.

How to use

import torch
from huggingface_hub import hf_hub_download

REPO = "abraraltaf92/deepfake-detection-models"

# 1. Download a checkpoint (cached). Pin a revision for reproducibility:
weights_path = hf_hub_download(
    repo_id=REPO,
    filename="resnet18_best.pth",
    revision="v2-2026-05-07",
)

# 2. Build the architecture (reference impl in the deepfake-detection-app backend).
model = build_resnet18_deepfake_detector()  # binary head: real / fake

# 3. Load fine-tuned weights β€” wrapped {"state_dict": ..., "meta": ...}
data = torch.load(weights_path, map_location="cpu", weights_only=False)
model.load_state_dict(data["state_dict"])
print("best_val_acc:", data["meta"].get("best_val_acc"))
model.eval()

The full ensemble inference pipeline (MTCNN face extraction β†’ RAFT optical-flow β†’ 5-model soft-vote β†’ Grad-CAM) lives in the deepfake-detection-app backend.

Decision threshold

The ensemble averages per-model fake-class probabilities and applies a calibrated decision threshold of Ο„ = 0.1. This was chosen to maximise F1 on the Celeb-DF v2 cross-dataset eval, where high-capacity static-image models (EfficientNet-B4, ViT-B/16) tend to under-call fakes that are caught by the motion-aware models. A low threshold lets one or two confident "fake" votes outweigh several uncertain "real" votes.

Limitations and intended use

  • Trained on FaceForensics++ C23 only β€” primarily Deepfakes / Face2Face / FaceSwap / NeuralTextures manipulations on YouTube faces.
  • Cross-dataset AUC of 0.885 on Celeb-DF v2 is strong but not perfect; expect false-positive and false-negative rates ~10–15% on out-of-distribution clips.
  • Faces only β€” the upstream pipeline assumes MTCNN can detect at least one face per frame. Animation, heavily-occluded faces, or non-frontal poses are out of scope.
  • Not a forensic tool. Use for research, demos, and educational content moderation. Do not rely on this for evidentiary or legal purposes.
  • Bias considerations: FF++ skews towards Western, well-lit, frontal-pose faces. Performance on under-represented demographics has not been formally audited.

Citation

If you use these weights, please cite:

@misc{zahid2026deepfake,
  title  = {Robust Deepfake Detection via Multi-View Anomaly Ensemble},
  author = {Zahid, Muhammad Abdullah and Lone, Abrar Altaf and Maniyar, Krishna Kirit and Shergill, Sajan Singh},
  year   = {2026},
  note   = {CS 668 Capstone, Pace University},
  url    = {https://github.com/abraraltaf92/deepfake-detection}
}

License

MIT β€” see LICENSE in the deepfake-detection-app repository.

Related links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support