--- license: mit language: - en tags: - deepfake-detection - video-classification - clip - computer-vision - pytorch - face-forensics - identity-verification - video-analysis - insightface - arcface pipeline_tag: video-classification datasets: - godmodes/rtfs-10k - hi-paris/FakeParts - bitmind/FaceForensicsC23 metrics: - roc_auc - accuracy - f1 - precision - recall model-index: - name: VIPER v3 results: - task: type: video-classification name: Deepfake Detection metrics: - name: AUC-ROC type: roc_auc value: 0.9909 - name: Accuracy type: accuracy value: 0.952 - name: F1 (Fake) type: f1 value: 0.96 - name: Precision (Fake) type: precision value: 0.948 - name: Recall (Fake) type: recall value: 0.965 --- # VIPER: Video Identity Perturbation and Extraction Residual

Deepfake detection inspired by displacement reactions in chemistry.
A stronger identity signal displaces and exposes synthetic faces.

--- ![VIPER Banner](assets/Viper_main1.png) --- ## Core Idea > *What if we could expose deepfakes the way chemistry exposes impurities?* ![Displacement Reaction](assets/Displacement_reaction.png) ``` AB + C โ†’ AC + B AB = video frame (fake face B hidden inside context A) C = identity anchor (biometric fingerprint from first 8 frames) AC = anchor bonds with real context โ†’ LOW energy = REAL B = fake face displaced/exposed โ†’ HIGH energy = FAKE ``` --- ## Results ![Results](assets/VIPER_Results1.png) | Metric | Value | |:-------|------:| | **AUC-ROC** | **0.9909** | | **Accuracy** | **95.2%** | | **Fake Recall** | **96.5%** | | **False Positive Rate** | **6.3%** | | Face-swap AUC | 0.9931 | | Expression-swap AUC | 0.9847 | | Inference speed | ~4s/video (GPU) | | Training time | 25 min (T4) | | Training data | 530 videos | ### Per-Manipulation-Type Detection ![Multiple Types](assets/multiple_types.png) | Attack Type | AUC | Accuracy | N (test) | |:------------|----:|---------:|--------:| | Face swap (inswapper) | 0.9931 | 95.6% | 42 | | Expression transfer (NeuralTextures) | 0.9847 | 93.7% | 15 | | **All combined** | **0.9909** | **95.2%** | **105** | ### Model Progression | Version | Backbone | Trainable Params | Test AUC | |:--------|:---------|:----------------:|---------:| | v1 | EfficientNet-B4 (frozen) | ~500K | 0.9072 | | v2 | EfficientNet-B4 (unfrozen) | ~2.3M | 0.9309 | | **v3** | **CLIP ViT-L/14 (frozen)** | **~500K** | **0.9909** | --- ## Architecture ![Architecture](assets/Viper_Architecture.png) ``` Video โ†’ InsightFace โ†’ 16 face crops (224ร—224) โ”‚ โ”œโ”€โ”€ Identity Anchor โ†’ GIR + TFR + BCR โ†’ 16-dim features โ”‚ โ””โ”€โ”€ CLIP ViT-L/14 (frozen) โ†’ 768-dim video embedding โ”‚ โ–ผ Fusion MLP [784 โ†’ 512 โ†’ 128 โ†’ 1] + TTA โ†’ REAL / FAKE ``` **Key design:** CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos. ### Three Biometric Signals | Signal | Method | Captures | |:------:|:-------|:---------| | **GIR** | ArcFace cosine distance | Skull geometry, eye spacing | | **TFR** | DCT KL divergence | Skin micro-texture | | **BCR** | dlib landmark coupling | Facial muscle dynamics | --- ## Confusion Matrix ``` Predicted Real Predicted Fake Actual Real 45 3 Actual Fake 2 55 ``` Only **5 errors** out of 105 test videos. --- ## Usage ```python import torch import open_clip from huggingface_hub import hf_hub_download import torch.nn as nn # Download checkpoint ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt") # Load CLIP clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai") clip_model.eval() # Model class VIPERv3(nn.Module): def __init__(self, clip_visual, dropout=0.4): super().__init__() self.clip = clip_visual for p in self.clip.parameters(): p.requires_grad = False self.head = nn.Sequential( nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout), nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5), nn.Linear(128, 1)) model = VIPERv3(clip_model.visual) model.load_state_dict(torch.load(ckpt, map_location="cpu")) model.eval() # Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16) # Output: logit โ†’ sigmoid โ†’ P(fake) ``` --- ## Training Dataset | Category | Count | Source | License | |:---------|------:|:-------|:--------| | Real | 250 | RTFS-10K | CC-BY-SA-4.0 | | Face swap | 220 | RTFS-10K (inswapper) | CC-BY-SA-4.0 | | Expression swap | 60 | FaceForensics++ | Academic | | Full-body GAN | 50 | FakeParts | CC0-1.0 | | **Total** | **580** | | | | Usable | 530 | 91.4% success | | --- ## Training Configuration | Parameter | Value | |:----------|:------| | Backbone | CLIP ViT-L/14 (OpenAI, frozen) | | Classifier | MLP 784โ†’512โ†’128โ†’1 | | Optimizer | AdamW (lr=3e-4, wd=1e-3) | | Scheduler | Cosine annealing, 15 epochs | | Batch size | 8 | | Loss | BCE with pos_weight=0.758 | | TTA | Horizontal flip average | | Hardware | NVIDIA T4 (16GB) | | Training time | ~25 minutes | --- ## Limitations - Full-body GAN videos not detectable (face detection fails) - Analytical signals (GIR/TFR/BCR) independently weak on modern fakes - Evaluated on 105 test videos โ€” larger benchmarks pending - Not tested against adversarial attacks on CLIP --- ## Citation ```bibtex @misc{singh2025viper, title = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis}, author = {Singh, Robin}, year = {2025}, url = {https://github.com/rxbinsingh/VIPER} } ``` --- ## Author **Robin Singh** ยท Bennett University, India [![GitHub](https://img.shields.io/badge/GitHub-rxbinsingh-black?logo=github)](https://github.com/rxbinsingh) [![HuggingFace](https://img.shields.io/badge/๐Ÿค—-rxbinsingh-FFD21E)](https://huggingface.co/rxbinsingh) [![ResearchGate](https://img.shields.io/badge/ResearchGate-Robin--Singh--61-00CCBB?logo=researchgate)](https://www.researchgate.net/profile/Robin-Singh-61)