metadata
license: mit
language:
- en
tags:
- deepfake-detection
- video-classification
- clip
- computer-vision
- pytorch
- face-forensics
- identity-verification
- video-analysis
- insightface
- arcface
pipeline_tag: video-classification
datasets:
- godmodes/rtfs-10k
- hi-paris/FakeParts
- bitmind/FaceForensicsC23
metrics:
- roc_auc
- accuracy
- f1
- precision
- recall
model-index:
- name: VIPER v3
results:
- task:
type: video-classification
name: Deepfake Detection
metrics:
- name: AUC-ROC
type: roc_auc
value: 0.9909
- name: Accuracy
type: accuracy
value: 0.952
- name: F1 (Fake)
type: f1
value: 0.96
- name: Precision (Fake)
type: precision
value: 0.948
- name: Recall (Fake)
type: recall
value: 0.965
VIPER: Video Identity Perturbation and Extraction Residual
Deepfake detection inspired by displacement reactions in chemistry.
A stronger identity signal displaces and exposes synthetic faces.
Core Idea
What if we could expose deepfakes the way chemistry exposes impurities?
AB + C β AC + B
AB = video frame (fake face B hidden inside context A)
C = identity anchor (biometric fingerprint from first 8 frames)
AC = anchor bonds with real context β LOW energy = REAL
B = fake face displaced/exposed β HIGH energy = FAKE
Results
| Metric | Value |
|---|---|
| AUC-ROC | 0.9909 |
| Accuracy | 95.2% |
| Fake Recall | 96.5% |
| False Positive Rate | 6.3% |
| Face-swap AUC | 0.9931 |
| Expression-swap AUC | 0.9847 |
| Inference speed | ~4s/video (GPU) |
| Training time | 25 min (T4) |
| Training data | 530 videos |
Per-Manipulation-Type Detection
| Attack Type | AUC | Accuracy | N (test) |
|---|---|---|---|
| Face swap (inswapper) | 0.9931 | 95.6% | 42 |
| Expression transfer (NeuralTextures) | 0.9847 | 93.7% | 15 |
| All combined | 0.9909 | 95.2% | 105 |
Model Progression
| Version | Backbone | Trainable Params | Test AUC |
|---|---|---|---|
| v1 | EfficientNet-B4 (frozen) | ~500K | 0.9072 |
| v2 | EfficientNet-B4 (unfrozen) | ~2.3M | 0.9309 |
| v3 | CLIP ViT-L/14 (frozen) | ~500K | 0.9909 |
Architecture
Video β InsightFace β 16 face crops (224Γ224)
β
βββ Identity Anchor β GIR + TFR + BCR β 16-dim features
β
βββ CLIP ViT-L/14 (frozen) β 768-dim video embedding
β
βΌ
Fusion MLP [784 β 512 β 128 β 1] + TTA β REAL / FAKE
Key design: CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.
Three Biometric Signals
| Signal | Method | Captures |
|---|---|---|
| GIR | ArcFace cosine distance | Skull geometry, eye spacing |
| TFR | DCT KL divergence | Skin micro-texture |
| BCR | dlib landmark coupling | Facial muscle dynamics |
Confusion Matrix
Predicted Real Predicted Fake
Actual Real 45 3
Actual Fake 2 55
Only 5 errors out of 105 test videos.
Usage
import torch
import open_clip
from huggingface_hub import hf_hub_download
import torch.nn as nn
# Download checkpoint
ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")
# Load CLIP
clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
clip_model.eval()
# Model
class VIPERv3(nn.Module):
def __init__(self, clip_visual, dropout=0.4):
super().__init__()
self.clip = clip_visual
for p in self.clip.parameters():
p.requires_grad = False
self.head = nn.Sequential(
nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
nn.Linear(128, 1))
model = VIPERv3(clip_model.visual)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()
# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
# Output: logit β sigmoid β P(fake)
Training Dataset
| Category | Count | Source | License |
|---|---|---|---|
| Real | 250 | RTFS-10K | CC-BY-SA-4.0 |
| Face swap | 220 | RTFS-10K (inswapper) | CC-BY-SA-4.0 |
| Expression swap | 60 | FaceForensics++ | Academic |
| Full-body GAN | 50 | FakeParts | CC0-1.0 |
| Total | 580 | ||
| Usable | 530 | 91.4% success |
Training Configuration
| Parameter | Value |
|---|---|
| Backbone | CLIP ViT-L/14 (OpenAI, frozen) |
| Classifier | MLP 784β512β128β1 |
| Optimizer | AdamW (lr=3e-4, wd=1e-3) |
| Scheduler | Cosine annealing, 15 epochs |
| Batch size | 8 |
| Loss | BCE with pos_weight=0.758 |
| TTA | Horizontal flip average |
| Hardware | NVIDIA T4 (16GB) |
| Training time | ~25 minutes |
Limitations
- Full-body GAN videos not detectable (face detection fails)
- Analytical signals (GIR/TFR/BCR) independently weak on modern fakes
- Evaluated on 105 test videos β larger benchmarks pending
- Not tested against adversarial attacks on CLIP
Citation
@misc{singh2025viper,
title = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
author = {Singh, Robin},
year = {2025},
url = {https://github.com/rxbinsingh/VIPER}
}
Author
Robin Singh Β· Bennett University, India




