VIPER / README.md

rxbinsingh

Update README.md

3faf78c verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

6.92 kB

metadata

license: mit
language:
  - en
tags:
  - deepfake-detection
  - video-classification
  - clip
  - computer-vision
  - pytorch
  - face-forensics
  - identity-verification
  - video-analysis
  - insightface
  - arcface
pipeline_tag: video-classification
datasets:
  - godmodes/rtfs-10k
  - hi-paris/FakeParts
  - bitmind/FaceForensicsC23
metrics:
  - roc_auc
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: VIPER v3
    results:
      - task:
          type: video-classification
          name: Deepfake Detection
        metrics:
          - name: AUC-ROC
            type: roc_auc
            value: 0.9909
          - name: Accuracy
            type: accuracy
            value: 0.952
          - name: F1 (Fake)
            type: f1
            value: 0.96
          - name: Precision (Fake)
            type: precision
            value: 0.948
          - name: Recall (Fake)
            type: recall
            value: 0.965

VIPER: Video Identity Perturbation and Extraction Residual

Deepfake detection inspired by displacement reactions in chemistry.
A stronger identity signal displaces and exposes synthetic faces.

Core Idea

What if we could expose deepfakes the way chemistry exposes impurities?

AB + C → AC + B

AB = video frame (fake face B hidden inside context A)
C  = identity anchor (biometric fingerprint from first 8 frames)
AC = anchor bonds with real context → LOW energy = REAL
B  = fake face displaced/exposed   → HIGH energy = FAKE

Results

Metric	Value
AUC-ROC	0.9909
Accuracy	95.2%
Fake Recall	96.5%
False Positive Rate	6.3%
Face-swap AUC	0.9931
Expression-swap AUC	0.9847
Inference speed	~4s/video (GPU)
Training time	25 min (T4)
Training data	530 videos

Per-Manipulation-Type Detection

Attack Type	AUC	Accuracy	N (test)
Face swap (inswapper)	0.9931	95.6%	42
Expression transfer (NeuralTextures)	0.9847	93.7%	15
All combined	0.9909	95.2%	105

Model Progression

Version	Backbone	Trainable Params	Test AUC
v1	EfficientNet-B4 (frozen)	~500K	0.9072
v2	EfficientNet-B4 (unfrozen)	~2.3M	0.9309
v3	CLIP ViT-L/14 (frozen)	~500K	0.9909

Architecture

Video → InsightFace → 16 face crops (224×224)
         │
         ├── Identity Anchor → GIR + TFR + BCR → 16-dim features
         │
         └── CLIP ViT-L/14 (frozen) → 768-dim video embedding
                   │
                   ▼
         Fusion MLP [784 → 512 → 128 → 1] + TTA → REAL / FAKE

Key design: CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.

Three Biometric Signals

Signal	Method	Captures
GIR	ArcFace cosine distance	Skull geometry, eye spacing
TFR	DCT KL divergence	Skin micro-texture
BCR	dlib landmark coupling	Facial muscle dynamics

Confusion Matrix

                 Predicted Real    Predicted Fake
Actual Real           45                3
Actual Fake            2               55

Only 5 errors out of 105 test videos.

Usage

import torch
import open_clip
from huggingface_hub import hf_hub_download
import torch.nn as nn

# Download checkpoint
ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")

# Load CLIP
clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
clip_model.eval()

# Model
class VIPERv3(nn.Module):
    def __init__(self, clip_visual, dropout=0.4):
        super().__init__()
        self.clip = clip_visual
        for p in self.clip.parameters():
            p.requires_grad = False
        self.head = nn.Sequential(
            nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
            nn.Linear(128, 1))

model = VIPERv3(clip_model.visual)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()

# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
# Output: logit → sigmoid → P(fake)

Training Dataset

Category	Count	Source	License
Real	250	RTFS-10K	CC-BY-SA-4.0
Face swap	220	RTFS-10K (inswapper)	CC-BY-SA-4.0
Expression swap	60	FaceForensics++	Academic
Full-body GAN	50	FakeParts	CC0-1.0
Total	580
Usable	530	91.4% success

Training Configuration

Parameter	Value
Backbone	CLIP ViT-L/14 (OpenAI, frozen)
Classifier	MLP 784→512→128→1
Optimizer	AdamW (lr=3e-4, wd=1e-3)
Scheduler	Cosine annealing, 15 epochs
Batch size	8
Loss	BCE with pos_weight=0.758
TTA	Horizontal flip average
Hardware	NVIDIA T4 (16GB)
Training time	~25 minutes

Limitations

Full-body GAN videos not detectable (face detection fails)
Analytical signals (GIR/TFR/BCR) independently weak on modern fakes
Evaluated on 105 test videos — larger benchmarks pending
Not tested against adversarial attacks on CLIP

Citation

@misc{singh2025viper,
  title   = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
  author  = {Singh, Robin},
  year    = {2025},
  url     = {https://github.com/rxbinsingh/VIPER}
}

Author

Robin Singh · Bennett University, India