---
license: mit
language:
  - en
tags:
  - deepfake-detection
  - video-classification
  - clip
  - computer-vision
  - pytorch
  - face-forensics
  - identity-verification
  - video-analysis
  - insightface
  - arcface
pipeline_tag: video-classification
datasets:
  - godmodes/rtfs-10k
  - hi-paris/FakeParts
  - bitmind/FaceForensicsC23
metrics:
  - roc_auc
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: VIPER v3
    results:
      - task:
          type: video-classification
          name: Deepfake Detection
        metrics:
          - name: AUC-ROC
            type: roc_auc
            value: 0.9909
          - name: Accuracy
            type: accuracy
            value: 0.952
          - name: F1 (Fake)
            type: f1
            value: 0.96
          - name: Precision (Fake)
            type: precision
            value: 0.948
          - name: Recall (Fake)
            type: recall
            value: 0.965
---

# VIPER: Video Identity Perturbation and Extraction Residual

<p align="center">
  <b>Deepfake detection inspired by displacement reactions in chemistry.</b><br>
  <i>A stronger identity signal displaces and exposes synthetic faces.</i>
</p>

<p align="center">
  <a href="https://github.com/rxbinsingh/VIPER"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github" /></a>
  <a href="https://huggingface.co/spaces/rxbinsingh/VIPER"><img src="https://img.shields.io/badge/🤗-Live%20Demo-green" /></a>
  <a href="https://colab.research.google.com/github/rxbinsingh/VIPER/blob/main/notebooks/VIPER_Train_Colab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
</p>

---

![VIPER Banner](assets/Viper_main1.png)

---

## Core Idea

> *What if we could expose deepfakes the way chemistry exposes impurities?*

![Displacement Reaction](assets/Displacement_reaction.png)

```
AB + C → AC + B

AB = video frame (fake face B hidden inside context A)
C  = identity anchor (biometric fingerprint from first 8 frames)
AC = anchor bonds with real context → LOW energy = REAL
B  = fake face displaced/exposed   → HIGH energy = FAKE
```

---

## Results

![Results](assets/VIPER_Results1.png)

| Metric | Value |
|:-------|------:|
| **AUC-ROC** | **0.9909** |
| **Accuracy** | **95.2%** |
| **Fake Recall** | **96.5%** |
| **False Positive Rate** | **6.3%** |
| Face-swap AUC | 0.9931 |
| Expression-swap AUC | 0.9847 |
| Inference speed | ~4s/video (GPU) |
| Training time | 25 min (T4) |
| Training data | 530 videos |

### Per-Manipulation-Type Detection

![Multiple Types](assets/multiple_types.png)

| Attack Type | AUC | Accuracy | N (test) |
|:------------|----:|---------:|--------:|
| Face swap (inswapper) | 0.9931 | 95.6% | 42 |
| Expression transfer (NeuralTextures) | 0.9847 | 93.7% | 15 |
| **All combined** | **0.9909** | **95.2%** | **105** |

### Model Progression

| Version | Backbone | Trainable Params | Test AUC |
|:--------|:---------|:----------------:|---------:|
| v1 | EfficientNet-B4 (frozen) | ~500K | 0.9072 |
| v2 | EfficientNet-B4 (unfrozen) | ~2.3M | 0.9309 |
| **v3** | **CLIP ViT-L/14 (frozen)** | **~500K** | **0.9909** |

---

## Architecture

![Architecture](assets/Viper_Architecture.png)

```
Video → InsightFace → 16 face crops (224×224)
         │
         ├── Identity Anchor → GIR + TFR + BCR → 16-dim features
         │
         └── CLIP ViT-L/14 (frozen) → 768-dim video embedding
                   │
                   ▼
         Fusion MLP [784 → 512 → 128 → 1] + TTA → REAL / FAKE
```

**Key design:** CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.

### Three Biometric Signals

| Signal | Method | Captures |
|:------:|:-------|:---------|
| **GIR** | ArcFace cosine distance | Skull geometry, eye spacing |
| **TFR** | DCT KL divergence | Skin micro-texture |
| **BCR** | dlib landmark coupling | Facial muscle dynamics |

---

## Confusion Matrix

```
                 Predicted Real    Predicted Fake
Actual Real           45                3
Actual Fake            2               55
```

Only **5 errors** out of 105 test videos.

---

## Usage

```python
import torch
import open_clip
from huggingface_hub import hf_hub_download
import torch.nn as nn

# Download checkpoint
ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")

# Load CLIP
clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
clip_model.eval()

# Model
class VIPERv3(nn.Module):
    def __init__(self, clip_visual, dropout=0.4):
        super().__init__()
        self.clip = clip_visual
        for p in self.clip.parameters():
            p.requires_grad = False
        self.head = nn.Sequential(
            nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
            nn.Linear(128, 1))

model = VIPERv3(clip_model.visual)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()

# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
# Output: logit → sigmoid → P(fake)
```

---

## Training Dataset

| Category | Count | Source | License |
|:---------|------:|:-------|:--------|
| Real | 250 | RTFS-10K | CC-BY-SA-4.0 |
| Face swap | 220 | RTFS-10K (inswapper) | CC-BY-SA-4.0 |
| Expression swap | 60 | FaceForensics++ | Academic |
| Full-body GAN | 50 | FakeParts | CC0-1.0 |
| **Total** | **580** | | |
| Usable | 530 | 91.4% success | |

---

## Training Configuration

| Parameter | Value |
|:----------|:------|
| Backbone | CLIP ViT-L/14 (OpenAI, frozen) |
| Classifier | MLP 784→512→128→1 |
| Optimizer | AdamW (lr=3e-4, wd=1e-3) |
| Scheduler | Cosine annealing, 15 epochs |
| Batch size | 8 |
| Loss | BCE with pos_weight=0.758 |
| TTA | Horizontal flip average |
| Hardware | NVIDIA T4 (16GB) |
| Training time | ~25 minutes |

---

## Limitations

- Full-body GAN videos not detectable (face detection fails)
- Analytical signals (GIR/TFR/BCR) independently weak on modern fakes
- Evaluated on 105 test videos — larger benchmarks pending
- Not tested against adversarial attacks on CLIP

---

## Citation

```bibtex
@misc{singh2025viper,
  title   = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
  author  = {Singh, Robin},
  year    = {2025},
  url     = {https://github.com/rxbinsingh/VIPER}
}
```

---

## Author

**Robin Singh** · Bennett University, India

[![GitHub](https://img.shields.io/badge/GitHub-rxbinsingh-black?logo=github)](https://github.com/rxbinsingh)
[![HuggingFace](https://img.shields.io/badge/🤗-rxbinsingh-FFD21E)](https://huggingface.co/rxbinsingh)
[![ResearchGate](https://img.shields.io/badge/ResearchGate-Robin--Singh--61-00CCBB?logo=researchgate)](https://www.researchgate.net/profile/Robin-Singh-61)