VIPER / README.md

Update README.md

3faf78c verified about 1 month ago

6.92 kB

	---
	license: mit
	language:
	- en
	tags:
	- deepfake-detection
	- video-classification
	- clip
	- computer-vision
	- pytorch
	- face-forensics
	- identity-verification
	- video-analysis
	- insightface
	- arcface
	pipeline_tag: video-classification
	datasets:
	- godmodes/rtfs-10k
	- hi-paris/FakeParts
	- bitmind/FaceForensicsC23
	metrics:
	- roc_auc
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: VIPER v3
	results:
	- task:
	type: video-classification
	name: Deepfake Detection
	metrics:
	- name: AUC-ROC
	type: roc_auc
	value: 0.9909
	- name: Accuracy
	type: accuracy
	value: 0.952
	- name: F1 (Fake)
	type: f1
	value: 0.96
	- name: Precision (Fake)
	type: precision
	value: 0.948
	- name: Recall (Fake)
	type: recall
	value: 0.965
	---

	# VIPER: Video Identity Perturbation and Extraction Residual

	<p align="center">
	<b>Deepfake detection inspired by displacement reactions in chemistry.</b><br>
	<i>A stronger identity signal displaces and exposes synthetic faces.</i>
	</p>

	<p align="center">
	<a href="https://github.com/rxbinsingh/VIPER"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github" /></a>
	<a href="https://huggingface.co/spaces/rxbinsingh/VIPER"><img src="https://img.shields.io/badge/🤗-Live%20Demo-green" /></a>
	<a href="https://colab.research.google.com/github/rxbinsingh/VIPER/blob/main/notebooks/VIPER_Train_Colab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
	</p>

	---

	![VIPER Banner](assets/Viper_main1.png)

	---

	## Core Idea

	> What if we could expose deepfakes the way chemistry exposes impurities?

	![Displacement Reaction](assets/Displacement_reaction.png)

	```
	AB + C → AC + B

	AB = video frame (fake face B hidden inside context A)
	C = identity anchor (biometric fingerprint from first 8 frames)
	AC = anchor bonds with real context → LOW energy = REAL
	B = fake face displaced/exposed → HIGH energy = FAKE
	```

	---

	## Results

	![Results](assets/VIPER_Results1.png)

	\| Metric \| Value \|
	\|:-------\|------:\|
	\| AUC-ROC \| 0.9909 \|
	\| Accuracy \| 95.2% \|
	\| Fake Recall \| 96.5% \|
	\| False Positive Rate \| 6.3% \|
	\| Face-swap AUC \| 0.9931 \|
	\| Expression-swap AUC \| 0.9847 \|
	\| Inference speed \| ~4s/video (GPU) \|
	\| Training time \| 25 min (T4) \|
	\| Training data \| 530 videos \|

	### Per-Manipulation-Type Detection

	![Multiple Types](assets/multiple_types.png)

	\| Attack Type \| AUC \| Accuracy \| N (test) \|
	\|:------------\|----:\|---------:\|--------:\|
	\| Face swap (inswapper) \| 0.9931 \| 95.6% \| 42 \|
	\| Expression transfer (NeuralTextures) \| 0.9847 \| 93.7% \| 15 \|
	\| All combined \| 0.9909 \| 95.2% \| 105 \|

	### Model Progression

	\| Version \| Backbone \| Trainable Params \| Test AUC \|
	\|:--------\|:---------\|:----------------:\|---------:\|
	\| v1 \| EfficientNet-B4 (frozen) \| ~500K \| 0.9072 \|
	\| v2 \| EfficientNet-B4 (unfrozen) \| ~2.3M \| 0.9309 \|
	\| v3 \| CLIP ViT-L/14 (frozen) \| ~500K \| 0.9909 \|

	---

	## Architecture

	![Architecture](assets/Viper_Architecture.png)

	```
	Video → InsightFace → 16 face crops (224×224)
	│
	├── Identity Anchor → GIR + TFR + BCR → 16-dim features
	│
	└── CLIP ViT-L/14 (frozen) → 768-dim video embedding
	│
	▼
	Fusion MLP [784 → 512 → 128 → 1] + TTA → REAL / FAKE
	```

	Key design: CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.

	### Three Biometric Signals

	\| Signal \| Method \| Captures \|
	\|:------:\|:-------\|:---------\|
	\| GIR \| ArcFace cosine distance \| Skull geometry, eye spacing \|
	\| TFR \| DCT KL divergence \| Skin micro-texture \|
	\| BCR \| dlib landmark coupling \| Facial muscle dynamics \|

	---

	## Confusion Matrix

	```
	Predicted Real Predicted Fake
	Actual Real 45 3
	Actual Fake 2 55
	```

	Only 5 errors out of 105 test videos.

	---

	## Usage

	```python
	import torch
	import open_clip
	from huggingface_hub import hf_hub_download
	import torch.nn as nn

	# Download checkpoint
	ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")

	# Load CLIP
	clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
	clip_model.eval()

	# Model
	class VIPERv3(nn.Module):
	def __init__(self, clip_visual, dropout=0.4):
	super().__init__()
	self.clip = clip_visual
	for p in self.clip.parameters():
	p.requires_grad = False
	self.head = nn.Sequential(
	nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
	nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
	nn.Linear(128, 1))

	model = VIPERv3(clip_model.visual)
	model.load_state_dict(torch.load(ckpt, map_location="cpu"))
	model.eval()

	# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
	# Output: logit → sigmoid → P(fake)
	```

	---

	## Training Dataset

	\| Category \| Count \| Source \| License \|
	\|:---------\|------:\|:-------\|:--------\|
	\| Real \| 250 \| RTFS-10K \| CC-BY-SA-4.0 \|
	\| Face swap \| 220 \| RTFS-10K (inswapper) \| CC-BY-SA-4.0 \|
	\| Expression swap \| 60 \| FaceForensics++ \| Academic \|
	\| Full-body GAN \| 50 \| FakeParts \| CC0-1.0 \|
	\| Total \| 580 \| \| \|
	\| Usable \| 530 \| 91.4% success \| \|

	---

	## Training Configuration

	\| Parameter \| Value \|
	\|:----------\|:------\|
	\| Backbone \| CLIP ViT-L/14 (OpenAI, frozen) \|
	\| Classifier \| MLP 784→512→128→1 \|
	\| Optimizer \| AdamW (lr=3e-4, wd=1e-3) \|
	\| Scheduler \| Cosine annealing, 15 epochs \|
	\| Batch size \| 8 \|
	\| Loss \| BCE with pos_weight=0.758 \|
	\| TTA \| Horizontal flip average \|
	\| Hardware \| NVIDIA T4 (16GB) \|
	\| Training time \| ~25 minutes \|

	---

	## Limitations

	- Full-body GAN videos not detectable (face detection fails)
	- Analytical signals (GIR/TFR/BCR) independently weak on modern fakes
	- Evaluated on 105 test videos — larger benchmarks pending
	- Not tested against adversarial attacks on CLIP

	---

	## Citation

	```bibtex
	@misc{singh2025viper,
	title = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
	author = {Singh, Robin},
	year = {2025},
	url = {https://github.com/rxbinsingh/VIPER}
	}
	```

	---

	## Author

	Robin Singh · Bennett University, India

	[![GitHub](https://img.shields.io/badge/GitHub-rxbinsingh-black?logo=github)](https://github.com/rxbinsingh)
	[![HuggingFace](https://img.shields.io/badge/🤗-rxbinsingh-FFD21E)](https://huggingface.co/rxbinsingh)
	[![ResearchGate](https://img.shields.io/badge/ResearchGate-Robin--Singh--61-00CCBB?logo=researchgate)](https://www.researchgate.net/profile/Robin-Singh-61)