mhamza-007 commited on
Commit
6fccdf6
·
verified ·
1 Parent(s): 971591d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -3
README.md CHANGED
@@ -1,3 +1,125 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ pipeline_tag: video-classification
8
+ tags:
9
+ - cvit
10
+ - deepfake-detection
11
+ - video-classification
12
+ - computer-vision
13
+ - vision-transformer
14
+ - binary-classification
15
+ ---
16
+
17
+ # 🔍 Convolutional Vision Transformer (CViT) for Deepfake Detection
18
+
19
+ The **Convolutional Vision Transformer (CViT)** is a hybrid architecture combining the powerful spatial feature extraction capabilities of CNNs with the long-range dependency modeling of Vision Transformers (ViT). This model is purpose-built for detecting deepfake videos and is trained on DFDC.
20
+
21
+ ---
22
+
23
+ ## Model Architecture
24
+
25
+ ### 1. Feature Learning (FL) Module - CNN Backbone
26
+ - Composed of **17 convolutional operations**.
27
+ - Unlike traditional VGG architectures, **FL focuses purely on feature extraction**, not classification.
28
+ - Accepts input of size **224 × 224 × 3 (RGB image)**.
29
+ - Outputs a **512 × 7 × 7** feature map.
30
+ - Contains **10.8 million learnable parameters**.
31
+
32
+ ### 2. Vision Transformer (ViT) Module
33
+ - Receives CNN output (**512 × 7 × 7**) as its input.
34
+ - Converts the 7×7 patches into a **1 × 1024** sequence using linear embedding.
35
+ - Adds **positional embeddings** of shape **(2 × 1024)**.
36
+ - ViT Encoder uses:
37
+ - **Multi-Head Self Attention (MSA)** with **8 attention heads**.
38
+ - **MLP blocks** with:
39
+ - First linear layer of **2048** units.
40
+ - Final linear layer of **2 units** (binary classification: Fake / Real).
41
+ - **ReLU activation** and **Softmax** for final probabilities.
42
+
43
+ ---
44
+
45
+ ## 🧪 Experimental Results
46
+
47
+ The CViT model was tested and evaluated across multiple deepfake datasets:
48
+
49
+ ### 📊 FaceForensics++ Accuracy
50
+ | Dataset | Accuracy |
51
+ |--------------------------------------|----------|
52
+ | FaceForensics++ FaceSwap | 69% |
53
+ | FaceForensics++ DeepFakeDetection | 91% |
54
+ | FaceForensics++ Deepfake | 93% |
55
+ | FaceForensics++ FaceShifter | 46% |
56
+ | FaceForensics++ NeuralTextures | 60% |
57
+
58
+ > **Note**: Poor performance on the FaceShifter dataset is attributed to the model's difficulty in learning subtle visual artifacts.
59
+
60
+ ---
61
+
62
+ ### 🧪 DFDC Evaluation
63
+
64
+ | Model | Validation | Test |
65
+ |---------------------|------------|--------|
66
+ | **CViT** | 87.25% | **91.5%** |
67
+
68
+ - **Unseen DFDC test videos**: 400
69
+ - **Accuracy**: 91.5%
70
+ - **AUC Score**: 0.91
71
+
72
+ ---
73
+
74
+ ### 🧪 UADFV AUC Comparison
75
+
76
+ | Model | Validation | FaceSwap | Face2Face |
77
+ |---------------|------------|----------|-----------|
78
+ | **CViT** | **93.75%** | 69.69% | 69.39% |
79
+
80
+ ---
81
+
82
+ ## ⚙️ Training Configuration
83
+
84
+ - **Loss Function**: Binary Cross Entropy (BCE)
85
+ - **Optimizer**: Adam
86
+ - **Learning Rate**: 1e-4
87
+ - **Weight Decay**: 1e-6
88
+ - **Batch Size**: 32
89
+ - **Epochs**: 50
90
+ - **Learning Rate Scheduler**: Reduces LR by factor of 0.1 every 15 epochs
91
+ - **Normalization**:
92
+ - Mean: `[0.485, 0.456, 0.406]`
93
+ - Std: `[0.229, 0.224, 0.225]`
94
+
95
+ ---
96
+
97
+ ## 🧪 Inference Setup
98
+
99
+ - **Input**: 30 normalized facial images (per video)
100
+ - **Classification**:
101
+ - Uses **log loss function** to compute confidence.
102
+ - Output is a probability `y ∈ [0, 1]`
103
+ - `0 < y < 0.5`: Real
104
+ - `0.5 ≤ y ≤ 1`: Fake
105
+ - Log loss penalizes:
106
+ - Random guesses
107
+ - Confident but incorrect predictions
108
+
109
+ ---
110
+
111
+ ## 🛠 Inference Example
112
+
113
+ ```python
114
+ from huggingface_hub import hf_hub_download
115
+ import torch
116
+
117
+ # Download model
118
+ model_path = hf_hub_download(
119
+ repo_id="mhamza-007/cvit_deepfake_detection",
120
+ filename="cvit2_deepfake_detection_ep_50.pth"
121
+ )
122
+
123
+ # Load model (example)
124
+ model = torch.load(model_path, map_location='cpu')
125
+ model.eval()