| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - deepfake-detection |
| - video-classification |
| - clip |
| - computer-vision |
| - pytorch |
| - face-forensics |
| - identity-verification |
| - video-analysis |
| - insightface |
| - arcface |
| pipeline_tag: video-classification |
| datasets: |
| - godmodes/rtfs-10k |
| - hi-paris/FakeParts |
| - bitmind/FaceForensicsC23 |
| metrics: |
| - roc_auc |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| model-index: |
| - name: VIPER v3 |
| results: |
| - task: |
| type: video-classification |
| name: Deepfake Detection |
| metrics: |
| - name: AUC-ROC |
| type: roc_auc |
| value: 0.9909 |
| - name: Accuracy |
| type: accuracy |
| value: 0.952 |
| - name: F1 (Fake) |
| type: f1 |
| value: 0.96 |
| - name: Precision (Fake) |
| type: precision |
| value: 0.948 |
| - name: Recall (Fake) |
| type: recall |
| value: 0.965 |
| --- |
| |
| # VIPER: Video Identity Perturbation and Extraction Residual |
|
|
| <p align="center"> |
| <b>Deepfake detection inspired by displacement reactions in chemistry.</b><br> |
| <i>A stronger identity signal displaces and exposes synthetic faces.</i> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://github.com/rxbinsingh/VIPER"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github" /></a> |
| <a href="https://huggingface.co/spaces/rxbinsingh/VIPER"><img src="https://img.shields.io/badge/π€-Live%20Demo-green" /></a> |
| <a href="https://colab.research.google.com/github/rxbinsingh/VIPER/blob/main/notebooks/VIPER_Train_Colab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a> |
| </p> |
|
|
| --- |
|
|
|  |
|
|
| --- |
|
|
| ## Core Idea |
|
|
| > *What if we could expose deepfakes the way chemistry exposes impurities?* |
|
|
|  |
|
|
| ``` |
| AB + C β AC + B |
| |
| AB = video frame (fake face B hidden inside context A) |
| C = identity anchor (biometric fingerprint from first 8 frames) |
| AC = anchor bonds with real context β LOW energy = REAL |
| B = fake face displaced/exposed β HIGH energy = FAKE |
| ``` |
|
|
| --- |
|
|
| ## Results |
|
|
|  |
|
|
| | Metric | Value | |
| |:-------|------:| |
| | **AUC-ROC** | **0.9909** | |
| | **Accuracy** | **95.2%** | |
| | **Fake Recall** | **96.5%** | |
| | **False Positive Rate** | **6.3%** | |
| | Face-swap AUC | 0.9931 | |
| | Expression-swap AUC | 0.9847 | |
| | Inference speed | ~4s/video (GPU) | |
| | Training time | 25 min (T4) | |
| | Training data | 530 videos | |
|
|
| ### Per-Manipulation-Type Detection |
|
|
|  |
|
|
| | Attack Type | AUC | Accuracy | N (test) | |
| |:------------|----:|---------:|--------:| |
| | Face swap (inswapper) | 0.9931 | 95.6% | 42 | |
| | Expression transfer (NeuralTextures) | 0.9847 | 93.7% | 15 | |
| | **All combined** | **0.9909** | **95.2%** | **105** | |
|
|
| ### Model Progression |
|
|
| | Version | Backbone | Trainable Params | Test AUC | |
| |:--------|:---------|:----------------:|---------:| |
| | v1 | EfficientNet-B4 (frozen) | ~500K | 0.9072 | |
| | v2 | EfficientNet-B4 (unfrozen) | ~2.3M | 0.9309 | |
| | **v3** | **CLIP ViT-L/14 (frozen)** | **~500K** | **0.9909** | |
|
|
| --- |
|
|
| ## Architecture |
|
|
|  |
|
|
| ``` |
| Video β InsightFace β 16 face crops (224Γ224) |
| β |
| βββ Identity Anchor β GIR + TFR + BCR β 16-dim features |
| β |
| βββ CLIP ViT-L/14 (frozen) β 768-dim video embedding |
| β |
| βΌ |
| Fusion MLP [784 β 512 β 128 β 1] + TTA β REAL / FAKE |
| ``` |
|
|
| **Key design:** CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos. |
|
|
| ### Three Biometric Signals |
|
|
| | Signal | Method | Captures | |
| |:------:|:-------|:---------| |
| | **GIR** | ArcFace cosine distance | Skull geometry, eye spacing | |
| | **TFR** | DCT KL divergence | Skin micro-texture | |
| | **BCR** | dlib landmark coupling | Facial muscle dynamics | |
|
|
| --- |
|
|
| ## Confusion Matrix |
|
|
| ``` |
| Predicted Real Predicted Fake |
| Actual Real 45 3 |
| Actual Fake 2 55 |
| ``` |
|
|
| Only **5 errors** out of 105 test videos. |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| import open_clip |
| from huggingface_hub import hf_hub_download |
| import torch.nn as nn |
| |
| # Download checkpoint |
| ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt") |
| |
| # Load CLIP |
| clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai") |
| clip_model.eval() |
| |
| # Model |
| class VIPERv3(nn.Module): |
| def __init__(self, clip_visual, dropout=0.4): |
| super().__init__() |
| self.clip = clip_visual |
| for p in self.clip.parameters(): |
| p.requires_grad = False |
| self.head = nn.Sequential( |
| nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout), |
| nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5), |
| nn.Linear(128, 1)) |
| |
| model = VIPERv3(clip_model.visual) |
| model.load_state_dict(torch.load(ckpt, map_location="cpu")) |
| model.eval() |
| |
| # Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16) |
| # Output: logit β sigmoid β P(fake) |
| ``` |
|
|
| --- |
|
|
| ## Training Dataset |
|
|
| | Category | Count | Source | License | |
| |:---------|------:|:-------|:--------| |
| | Real | 250 | RTFS-10K | CC-BY-SA-4.0 | |
| | Face swap | 220 | RTFS-10K (inswapper) | CC-BY-SA-4.0 | |
| | Expression swap | 60 | FaceForensics++ | Academic | |
| | Full-body GAN | 50 | FakeParts | CC0-1.0 | |
| | **Total** | **580** | | | |
| | Usable | 530 | 91.4% success | | |
|
|
| --- |
|
|
| ## Training Configuration |
|
|
| | Parameter | Value | |
| |:----------|:------| |
| | Backbone | CLIP ViT-L/14 (OpenAI, frozen) | |
| | Classifier | MLP 784β512β128β1 | |
| | Optimizer | AdamW (lr=3e-4, wd=1e-3) | |
| | Scheduler | Cosine annealing, 15 epochs | |
| | Batch size | 8 | |
| | Loss | BCE with pos_weight=0.758 | |
| | TTA | Horizontal flip average | |
| | Hardware | NVIDIA T4 (16GB) | |
| | Training time | ~25 minutes | |
| |
| --- |
| |
| ## Limitations |
| |
| - Full-body GAN videos not detectable (face detection fails) |
| - Analytical signals (GIR/TFR/BCR) independently weak on modern fakes |
| - Evaluated on 105 test videos β larger benchmarks pending |
| - Not tested against adversarial attacks on CLIP |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{singh2025viper, |
| title = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis}, |
| author = {Singh, Robin}, |
| year = {2025}, |
| url = {https://github.com/rxbinsingh/VIPER} |
| } |
| ``` |
| |
| --- |
| |
| ## Author |
| |
| **Robin Singh** Β· Bennett University, India |
| |
| [](https://github.com/rxbinsingh) |
| [](https://huggingface.co/rxbinsingh) |
| [](https://www.researchgate.net/profile/Robin-Singh-61) |
| |
| |