Susovan85 faisalishfaq2005 commited on
Commit
96159fa
·
0 Parent(s):

Duplicate from faisalishfaq2005/deepfake-detection-efficientnet-vit

Browse files

Co-authored-by: Muhammad Faisal Ishfaq <faisalishfaq2005@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.jpg filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: pytorch
4
+ license: mit
5
+ tags:
6
+ - deepfake-detection
7
+ - image-classification
8
+ - video-analysis
9
+ - efficientvit
10
+ - pytorch
11
+ pipeline_tag: image-classification
12
+
13
+ safetensors:
14
+ total: 1
15
+ format: safetensors
16
+ weight_dtype: float32
17
+ size_in_bytes: 80000000
18
+
19
+ model-index:
20
+ - name: Deepfake Detection with Improved EfficientViT
21
+ results:
22
+ - task:
23
+ type: image-classification
24
+ name: Deepfake Detection
25
+ dataset:
26
+ type: custom
27
+ name: FaceForensics++,Celeb-DF
28
+ metrics:
29
+ - name: Accuracy
30
+ type: accuracy
31
+ value: 0.8864
32
+ - name: Precision
33
+ type: precision
34
+ value: 0.8920
35
+ - name: Recall
36
+ type: recall
37
+ value: 0.8792
38
+ - name: F1-score
39
+ type: f1
40
+ value: 0.8856
41
+
42
+ config: config.json
43
+ metadata:
44
+ model_type: EfficientViT
45
+ num_parameters: 20026725
46
+ precision: float32
47
+ framework: pytorch
48
+ license: mit
49
+ model_format: safetensors
50
+ size: 82MB
51
+ ---
52
+
53
+ # Deepfake Detection with Improved EfficientViT
54
+
55
+ ## Model Architecture
56
+
57
+ ![Model Architecture](assets/architecture.png)
58
+
59
+ ## Inference Pipeline
60
+
61
+ ![Inference Pipeline](assets/inference_pipeline.png)
62
+
63
+
64
+ This repository contains a **PyTorch model for deepfake detection** based on an improved **EfficientViT** architecture, trained on video data.
65
+
66
+ The model predicts whether a video is **real (0)** or **fake (1)** using both visual information and temporal cues.
67
+
68
+ ---
69
+
70
+ ## 🧩 Model Description
71
+
72
+ **Architecture:** Improved EfficientViT
73
+ **Backbone:** EfficientNet-B0 for feature extraction
74
+ **Head:** Transformer-based temporal modeling with classification head
75
+ **Input:** Video frames (224×224 RGB images)
76
+ **Output:** Binary label (0=Real, 1=Fake) and frame-level probabilities
77
+
78
+ **Key Features:**
79
+
80
+ - Extracts faces from frames using MTCNN
81
+ - Supports inference on raw video files
82
+ - Provides frame-level probabilities for fine-grained analysis
83
+
84
+ ---
85
+
86
+ ## 📁 Repository Structure
87
+
88
+ ```
89
+ deepfake-efficientvit/
90
+
91
+ ├── model.py # ImprovedEfficientViT class
92
+ ├── inference.py # Functions to run inference on videos
93
+ ├── model.pth # Trained weights
94
+ ├── config.json # Optional model metadata
95
+ ├── requirements.txt # Required packages
96
+ ├── README.md
97
+
98
+ ```
99
+
100
+ ## ⚡ Installation
101
+ git clone https://huggingface.co/faisalishfaq2005/deepfake-detection-efficientnet-vit
102
+
103
+ cd deepfake-detection-efficientnet-vit
104
+
105
+ pip install -r requirements.txt
106
+
107
+ ## 🚀 Usage
108
+ # 1.Programmatic Inference
109
+
110
+ ```python
111
+
112
+ from huggingface_hub import hf_hub_download
113
+ from safetensors.torch import load_file
114
+ import torch
115
+ from model import ImprovedEfficientViT
116
+ from inference import predict_vedio
117
+
118
+ # 1️⃣ Download the checkpoint from Hugging Face
119
+ checkpoint_path = hf_hub_download(
120
+ repo_id="faisalishfaq2005/deepfake-detection-efficientnet-vit",
121
+ filename="model.safetensors"
122
+ )
123
+
124
+ # 2️⃣ Load the model weights safely
125
+ state_dict = load_file(checkpoint_path, device="cpu")
126
+ model = ImprovedEfficientViT()
127
+ model.load_state_dict(state_dict)
128
+ model.eval()
129
+
130
+ # 4️⃣ Move to GPU if available
131
+ device = "cuda" if torch.cuda.is_available() else "cpu"
132
+ model.to(device)
133
+
134
+ # 3️⃣ Run inference on a video
135
+ video_path = "sample_video.mp4"
136
+ result = predict_vedio(video_path, model)
137
+ print(result)
138
+ # Example Output: {'class': 1}
139
+
140
+ ```
141
+ # 2. Manual Download
142
+
143
+ Go to the Hugging Face model page
144
+
145
+ Download:
146
+
147
+ model.pth
148
+
149
+ model.py
150
+
151
+ inference.py
152
+
153
+ Place them in the same folder locally.
154
+
155
+ Install requirements and run predict_video().
156
+
157
+ ## 📄 License
158
+
159
+ This model is released under the MIT License.
160
+ You are free to use, modify, and distribute it, with attribution.
161
+
162
+ ## 📚 Citation
163
+
164
+ If you use this model in your research, please cite:
165
+
166
+ ```bibtex
167
+ @inproceedings{faisalishfaq2025efficientvit,
168
+ title={Deepfake Detection with Efficientnet and ViT},
169
+ author={Faisal Ishfaq},
170
+ year={2025}
171
+ }
172
+ ```
173
+
174
+
assets/architecture.png ADDED

Git LFS Details

  • SHA256: f8fb94408905f906b3bad72b59872cf56e8eab865d5252b7eae7a2d100d62d94
  • Pointer size: 131 Bytes
  • Size of remote file: 633 kB
assets/inference_pipeline.png ADDED

Git LFS Details

  • SHA256: 093933321b0f1e56675eaac6153ba9138e5e9332ed05a3baf04b290c22c2be61
  • Pointer size: 132 Bytes
  • Size of remote file: 1.3 MB
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": ["ImprovedEfficientViT"],
3
+ "model_type": "efficientnetb0_Vit_blocks_multi_head_attention",
4
+ "framework": "pytorch",
5
+ "precision": "float32",
6
+ "num_parameters": 20026725,
7
+ "model_format": "safetensors",
8
+ "license": "mit",
9
+ "tags": [
10
+ "image-classification",
11
+ "video-analysis",
12
+ "deepfake-detection",
13
+ "efficientvit",
14
+ "pytorch"
15
+ ]
16
+ }
inference.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from torchvision import transforms
2
+ import torch
3
+ from PIL import Image
4
+ from model import ImprovedEfficientViT
5
+
6
+ import os
7
+ import cv2
8
+ from mtcnn import MTCNN
9
+
10
+ def extract_faces(video_path, target_frames=20):
11
+
12
+ detector = MTCNN()
13
+
14
+ cap = cv2.VideoCapture(video_path)
15
+ if not cap.isOpened():
16
+ print(f"Error: Could not open video {video_path}")
17
+ return []
18
+
19
+ fps = cap.get(cv2.CAP_PROP_FPS)
20
+ total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
21
+
22
+ frame_interval = max(total_frames // target_frames, 1)
23
+
24
+ face_images = []
25
+
26
+ for i in range(0, total_frames, frame_interval):
27
+ cap.set(cv2.CAP_PROP_POS_FRAMES, i)
28
+ ret, frame = cap.read()
29
+ if not ret:
30
+ continue
31
+
32
+ rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
33
+ faces = detector.detect_faces(rgb_frame)
34
+
35
+ for face in faces:
36
+ if face['confidence'] < 0.9:
37
+ continue
38
+ x, y, w, h = face['box']
39
+ x, y = max(x, 0), max(y, 0)
40
+ face_img = rgb_frame[y:y+h, x:x+w]
41
+
42
+ if face_img.size == 0:
43
+ continue
44
+
45
+ face_img = cv2.resize(face_img, (224, 224))
46
+ face_images.append(face_img)
47
+
48
+ cap.release()
49
+ return face_images
50
+
51
+ from torchvision import transforms
52
+ transform_vedio=transforms.Compose([
53
+ transforms.ToPILImage(),
54
+ transforms.Resize((224,224)),
55
+ transforms.ToTensor(),
56
+ transforms.Normalize(mean=[0.5],std=[0.5])
57
+
58
+ ])
59
+
60
+
61
+ def predict_vedio(video_path,model_vedio):
62
+
63
+ pred_list = []
64
+ prob_list=[]
65
+
66
+ faces = extract_faces(video_path, target_frames=20)
67
+
68
+ transformed_faces = [transform_vedio(face) for face in faces]
69
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
70
+ model_vedio.to(device)
71
+
72
+ for face in transformed_faces:
73
+ face = face.to(device).unsqueeze(0)
74
+
75
+ with torch.no_grad():
76
+ logit = model_vedio(face)
77
+ prob = torch.sigmoid(logit)
78
+ pred = int(prob.item() > 0.5)
79
+ pred_list.append(pred)
80
+ prob_list.append(prob)
81
+
82
+ count=0
83
+ for ele in pred_list:
84
+ if ele==0:
85
+ count+=1
86
+
87
+ predicted_class=0 if count>3 else 1
88
+ return{
89
+ "class":predicted_class
90
+ }
91
+
model.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torchvision
4
+ import math
5
+
6
+ class ImprovedEfficientBackbone(nn.Module):
7
+ def __init__(self):
8
+ super().__init__()
9
+ self.efficientnet = torchvision.models.efficientnet_b0(weights=torchvision.models.EfficientNet_B0_Weights.IMAGENET1K_V1)
10
+ self.features = self.efficientnet.features
11
+
12
+ def forward(self, x):
13
+ return self.features(x)
14
+
15
+ class ImprovedPatchEmbedding(nn.Module):
16
+ def __init__(self, in_channels=1280, embed_dim=384):
17
+ super().__init__()
18
+ self.proj = nn.Linear(in_channels, embed_dim)
19
+
20
+ def forward(self, x):
21
+ """
22
+ Input: [B, 1280, 7, 7]
23
+ Output: [B, 49, 384]
24
+ """
25
+ B, C, H, W = x.shape
26
+ x = x.flatten(2).transpose(1, 2)
27
+ x = self.proj(x)
28
+ return x
29
+
30
+
31
+ class ImprovedViTBlock(nn.Module):
32
+ def __init__(self, embed_dim=384, num_heads=4, mlp_ratio=4):
33
+ super().__init__()
34
+ self.norm1 = nn.LayerNorm(embed_dim)
35
+ self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
36
+ self.norm2 = nn.LayerNorm(embed_dim)
37
+ self.mlp = nn.Sequential(
38
+ nn.Linear(embed_dim, embed_dim * mlp_ratio),
39
+ nn.GELU(),
40
+ nn.Linear(embed_dim * mlp_ratio, embed_dim)
41
+ )
42
+ self.dropout = nn.Dropout(0.2)
43
+
44
+ def forward(self, x):
45
+ x = x + self.dropout(self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0])
46
+ x = x + self.dropout(self.mlp(self.norm2(x)))
47
+ return x
48
+
49
+ class ImprovedEfficientViT(nn.Module):
50
+ def __init__(self, embed_dim=384, depth=6, num_heads=4):
51
+ super().__init__()
52
+ self.backbone = ImprovedEfficientBackbone()
53
+ self.patch_embed = ImprovedPatchEmbedding(embed_dim=embed_dim)
54
+
55
+ self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
56
+ self.register_buffer("pos_embed", self._get_sinusoidal_encoding(50, embed_dim))
57
+
58
+ self.patch_dropout = nn.Dropout(0.2)
59
+ self.pos_dropout = nn.Dropout(0.2)
60
+
61
+ self.blocks = nn.ModuleList([ImprovedViTBlock(embed_dim, num_heads) for _ in range(depth)])
62
+
63
+ self.head = nn.Sequential(
64
+ nn.LayerNorm(embed_dim),
65
+ nn.Linear(embed_dim, 128),
66
+ nn.GELU(),
67
+ nn.Dropout(0.3),
68
+ nn.Linear(128, 1)
69
+ )
70
+
71
+ self._init_weights()
72
+
73
+ def _init_weights(self):
74
+ nn.init.trunc_normal_(self.cls_token, std=0.02)
75
+
76
+ def _get_sinusoidal_encoding(self, seq_len, dim):
77
+ pe = torch.zeros(seq_len, dim)
78
+ position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
79
+ div_term = torch.exp(torch.arange(0, dim, 2).float() * (-math.log(10000.0) / dim))
80
+ pe[:, 0::2] = torch.sin(position * div_term)
81
+ pe[:, 1::2] = torch.cos(position * div_term)
82
+ return pe.unsqueeze(0)
83
+
84
+ def forward(self, x):
85
+ features = self.backbone(x)
86
+ tokens = self.patch_embed(features)
87
+ tokens = self.patch_dropout(tokens)
88
+
89
+ B = tokens.shape[0]
90
+ cls_tokens = self.cls_token.expand(B, -1, -1)
91
+ x = torch.cat((cls_tokens, tokens), dim=1)
92
+ x = x + self.pos_embed[:, :x.size(1), :]
93
+ x = self.pos_dropout(x)
94
+
95
+ for block in self.blocks:
96
+ x = block(x)
97
+
98
+ cls_final = x[:, 0]
99
+ return self.head(cls_final)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bbebb0e63de194963276739b73f194a9eda09221c3e73563a8fa87ddaac38120
3
+ size 82444700
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch
2
+ torchvision
3
+ opencv-python
4
+ mtcnn
5
+ Pillow