| # DeepFake Detector V13 π― | |
| **State-of-the-art deepfake detection ensemble with 699M parameters** | |
| [](https://huggingface.co/ash12321/deepfake-detector-v13) | |
| [](https://huggingface.co/ash12321/deepfake-detector-v13) | |
| [](https://huggingface.co/ash12321/deepfake-detector-v13) | |
| ## π Performance Highlights | |
| - **Average Ensemble F1**: 0.9313 | |
| - **Best Model F1**: 0.9586 (Model 13.3 - Swin-Large) | |
| - **Total Parameters**: 699M (exceeds 500M requirement β ) | |
| - **Training Time**: ~6.1 hours on T4 GPU | |
| ## π Architecture | |
| This model consists of 3 large-scale transformer and CNN models trained sequentially: | |
| | Model | Backbone | Parameters | F1 Score | Training Time | | |
| |-------|----------|------------|----------|---------------| | |
| | **Model 13.1** | ConvNeXt-Large | 198M | 0.8971 | 205.7 min | | |
| | **Model 13.2** | ViT-Large | 304M | 0.9382 | 52.7 min | | |
| | **Model 13.3** | Swin-Large | 197M | **0.9586** | 106.2 min | | |
| **Total: 699M parameters** | |
| ### Model Files | |
| - `model_1.safetensors` - ConvNeXt-Large (752 MB) | |
| - `model_2.safetensors` - ViT-Large (1159 MB) | |
| - `model_3.safetensors` - Swin-Large (747 MB) | |
| ## π― Usage | |
| ### Installation | |
| ```bash | |
| pip install torch torchvision timm safetensors pillow | |
| ``` | |
| ### Quick Start - Single Model | |
| ```python | |
| import torch | |
| import timm | |
| from PIL import Image | |
| from torchvision import transforms | |
| from safetensors.torch import load_file | |
| # Define model architecture | |
| class DeepfakeDetector(torch.nn.Module): | |
| def __init__(self, backbone_name, dropout=0.3): | |
| super().__init__() | |
| self.backbone = timm.create_model(backbone_name, pretrained=False, num_classes=0) | |
| if hasattr(self.backbone, 'num_features'): | |
| feat_dim = self.backbone.num_features | |
| else: | |
| with torch.no_grad(): | |
| feat_dim = self.backbone(torch.randn(1, 3, 224, 224)).shape[1] | |
| self.classifier = torch.nn.Sequential( | |
| torch.nn.Linear(feat_dim, 512), | |
| torch.nn.BatchNorm1d(512), | |
| torch.nn.GELU(), | |
| torch.nn.Dropout(dropout), | |
| torch.nn.Linear(512, 128), | |
| torch.nn.BatchNorm1d(128), | |
| torch.nn.GELU(), | |
| torch.nn.Dropout(dropout * 0.5), | |
| torch.nn.Linear(128, 1) | |
| ) | |
| def forward(self, x): | |
| features = self.backbone(x) | |
| return self.classifier(features).squeeze(-1) | |
| # Load best model (Model 13.3 - Swin-Large) | |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') | |
| model = DeepfakeDetector('swin_large_patch4_window7_224', dropout=0.3) | |
| state_dict = load_file('model_3.safetensors') | |
| model.load_state_dict(state_dict) | |
| model = model.to(device) | |
| model.eval() | |
| # Preprocessing | |
| transform = transforms.Compose([ | |
| transforms.Resize((224, 224)), | |
| transforms.ToTensor(), | |
| transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) | |
| ]) | |
| # Predict | |
| image = Image.open('test_image.jpg').convert('RGB') | |
| input_tensor = transform(image).unsqueeze(0).to(device) | |
| with torch.no_grad(): | |
| logits = model(input_tensor) | |
| probability = torch.sigmoid(logits).item() | |
| prediction = 'FAKE' if probability > 0.5 else 'REAL' | |
| print(f"Prediction: {prediction}") | |
| print(f"Confidence: {probability:.2%}") | |
| ``` | |
| ### Full Ensemble (Recommended) | |
| ```python | |
| import torch | |
| import timm | |
| from PIL import Image | |
| from torchvision import transforms | |
| from safetensors.torch import load_file | |
| class DeepfakeDetector(torch.nn.Module): | |
| def __init__(self, backbone_name, dropout=0.3): | |
| super().__init__() | |
| self.backbone = timm.create_model(backbone_name, pretrained=False, num_classes=0) | |
| if hasattr(self.backbone, 'num_features'): | |
| feat_dim = self.backbone.num_features | |
| else: | |
| with torch.no_grad(): | |
| feat_dim = self.backbone(torch.randn(1, 3, 224, 224)).shape[1] | |
| self.classifier = torch.nn.Sequential( | |
| torch.nn.Linear(feat_dim, 512), | |
| torch.nn.BatchNorm1d(512), | |
| torch.nn.GELU(), | |
| torch.nn.Dropout(dropout), | |
| torch.nn.Linear(512, 128), | |
| torch.nn.BatchNorm1d(128), | |
| torch.nn.GELU(), | |
| torch.nn.Dropout(dropout * 0.5), | |
| torch.nn.Linear(128, 1) | |
| ) | |
| def forward(self, x): | |
| features = self.backbone(x) | |
| return self.classifier(features).squeeze(-1) | |
| # Model configurations | |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') | |
| configs = [ | |
| ('convnext_large', 0.3, 'model_1.safetensors'), | |
| ('vit_large_patch16_224', 0.35, 'model_2.safetensors'), | |
| ('swin_large_patch4_window7_224', 0.3, 'model_3.safetensors') | |
| ] | |
| # Load all models | |
| models = [] | |
| for backbone, dropout, filename in configs: | |
| model = DeepfakeDetector(backbone, dropout) | |
| state_dict = load_file(filename) | |
| model.load_state_dict(state_dict) | |
| model = model.to(device) | |
| model.eval() | |
| models.append(model) | |
| print(f"β Loaded {len(models)} models") | |
| # Preprocessing | |
| transform = transforms.Compose([ | |
| transforms.Resize((224, 224)), | |
| transforms.ToTensor(), | |
| transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) | |
| ]) | |
| # Ensemble prediction | |
| def predict_ensemble(image_path): | |
| image = Image.open(image_path).convert('RGB') | |
| input_tensor = transform(image).unsqueeze(0).to(device) | |
| predictions = [] | |
| with torch.no_grad(): | |
| for model in models: | |
| logits = model(input_tensor) | |
| prob = torch.sigmoid(logits).item() | |
| predictions.append(prob) | |
| # Average ensemble | |
| avg_prob = sum(predictions) / len(predictions) | |
| prediction = 'FAKE' if avg_prob > 0.5 else 'REAL' | |
| return { | |
| 'prediction': prediction, | |
| 'confidence': avg_prob, | |
| 'individual_predictions': predictions | |
| } | |
| # Use it | |
| result = predict_ensemble('test_image.jpg') | |
| print(f"Prediction: {result['prediction']}") | |
| print(f"Ensemble Confidence: {result['confidence']:.2%}") | |
| print(f"Individual Models: {[f'{p:.2%}' for p in result['individual_predictions']]}") | |
| ``` | |
| ## π Training Details | |
| ### Architecture Design | |
| Each model uses: | |
| - **Backbone**: Large pre-trained vision model (frozen initially, fine-tuned) | |
| - **Classifier Head**: | |
| - Linear(feat_dim β 512) + BatchNorm + GELU + Dropout | |
| - Linear(512 β 128) + BatchNorm + GELU + Dropout | |
| - Linear(128 β 1) | |
| ### Training Configuration | |
| - **Loss Function**: Focal Loss with Label Smoothing | |
| - Alpha: 0.25 | |
| - Gamma: 2.5 | |
| - Label Smoothing: 0.12 | |
| - **Optimizer**: AdamW | |
| - Learning Rates: [2e-5, 1.5e-5, 1.8e-5] | |
| - Weight Decay: 3e-4 | |
| - **Scheduler**: CosineAnnealingWarmRestarts (T_0=3, T_mult=2) | |
| - **Epochs**: 10 per model | |
| - **Batch Sizes**: [32, 24, 32] | |
| - **Mixed Precision**: FP16 enabled | |
| - **Gradient Accumulation**: 4 steps | |
| - **Gradient Checkpointing**: Enabled (memory efficiency) | |
| ### Data Augmentation | |
| - Random Horizontal Flip (p=0.5) | |
| - Random Rotation (Β±12Β°) | |
| - Color Jitter (brightness, contrast, saturation: Β±0.15) | |
| - Normalization: ImageNet stats | |
| ## π Performance Analysis | |
| ### Model Comparison | |
| **Model 13.1 (ConvNeXt-Large)** | |
| - β Solid baseline: F1 = 0.8971 | |
| - β CNN-based architecture | |
| - β Good for local feature extraction | |
| **Model 13.2 (ViT-Large)** | |
| - β Strong performance: F1 = 0.9382 | |
| - β Fastest training (52.7 min) | |
| - β Global attention mechanism | |
| **Model 13.3 (Swin-Large)** β **Best Model** | |
| - β Excellent performance: F1 = 0.9586 | |
| - β Hierarchical vision transformer | |
| - β Best balance of accuracy and efficiency | |
| ### Ensemble Benefits | |
| The ensemble approach provides: | |
| - **Improved Robustness**: Different architectures capture different patterns | |
| - **Reduced Variance**: Averaging reduces prediction noise | |
| - **Better Generalization**: Complementary strengths minimize overfitting | |
| - **Higher Accuracy**: Expected ensemble F1 β 0.94-0.96 | |
| ## π§ System Requirements | |
| **Inference (Single Model)** | |
| - GPU: 4GB+ VRAM | |
| - RAM: 8GB+ | |
| - Storage: ~1.2 GB per model | |
| **Inference (Full Ensemble)** | |
| - GPU: 12GB+ VRAM (or run models sequentially on smaller GPU) | |
| - RAM: 16GB+ | |
| - Storage: ~2.7 GB total | |
| **Training** | |
| - GPU: T4 (16GB) or better | |
| - RAM: 12GB+ | |
| - Storage: 8GB+ for checkpoints | |
| ## π Dataset | |
| Trained on: [`ash12321/deepfake-v13-dataset`](https://huggingface.co/datasets/ash12321/deepfake-v13-dataset) | |
| ## π Related Models | |
| - Predecessor: [`ash12321/deepfake-detector-v12`](https://huggingface.co/ash12321/deepfake-detector-v12) | |
| ## π Citation | |
| ```bibtex | |
| @model{v13-deepfake-detector, | |
| title={DeepFake Detector V13: Large-Scale Ensemble}, | |
| author={Ash}, | |
| year={2024}, | |
| publisher={Hugging Face}, | |
| howpublished={\url{https://huggingface.co/ash12321/deepfake-detector-v13}} | |
| } | |
| ``` | |
| ## π License | |
| MIT License - See LICENSE file for details | |
| ## π Acknowledgments | |
| - Built with PyTorch, timm, and Hugging Face | |
| - Trained on Google Colab T4 GPU | |
| - Architectures: ConvNeXt (Meta), ViT (Google), Swin (Microsoft) | |
| --- | |
| **Model Version**: 13.0 | |
| **Last Updated**: November 2024 | |
| **Status**: Production Ready β | |