--- license: mit language: - en metrics: - accuracy - f1 - precision - recall base_model: - google/vit-base-patch16-224-in21k library_name: transformers tags: - deepfake detection - fake-image detection --- # ViT Deepfake Detection Model ## Model Description This is a fine-tuned Vision Transformer (ViT) model for binary image classification to detect deepfake images. The model is based on `google/vit-base-patch16-224-in21k` and has been fine-tuned on the OpenForensics dataset to distinguish between real and fake (AI-generated/manipulated) images. ## Model Details - **Model Type:** Vision Transformer (ViT) for Image Classification - **Base Model:** google/vit-base-patch16-224-in21k - **Task:** Binary Image Classification (Real vs Fake Detection) - **Language:** N/A (Computer Vision) - **License:** Apache 2.0 ## Intended Use ### Primary Use Cases - Detecting AI-generated or manipulated images - Content moderation and verification - Research in deepfake detection - Media authenticity verification ### Out-of-Scope Use - This model should not be used as the sole method for making critical decisions about content authenticity - Not intended for surveillance or privacy-invasive applications - May not generalize well to deepfake techniques not present in the training data ## Training Data The model was trained on the **OpenForensics dataset** with the following distribution: - **Training Set:** 16,000 images - **Validation Set:** 2000 images - **Test Set:** 2000 images Images were preprocessed and transformed using ViTImageProcessor with standard normalization. ## Training Procedure ### Hyperparameters ```python Training Arguments: - Batch Size: 24 per device - Gradient Accumulation Steps: 1 - Mixed Precision: FP16 - Number of Epochs: 10 - Learning Rate: 3e-5 - Weight Decay: 0.02 - Warmup Ratio: 0.08 - LR Scheduler: Cosine - Label Smoothing: 0.05 - Optimizer: AdamW (default) ``` ### Training Hardware - GPU: Tesla T4 - Training Time: ~14 minutes for 10 epochs ### Data Augmentation Standard ViT preprocessing with normalization applied via `ViTImageProcessor`. ## Performance ### Validation Set Results (Best Epoch - Epoch 5) | Metric | Score | |--------|-------| | Accuracy | 96.22% | | F1 Score | 96.22% | | Precision | 96.30% | | Recall | 96.22% | ### Test Set Results | Metric | Score | |--------|-------| | Accuracy | **96.56%** | ### Training Progress The model showed consistent improvement across epochs: | Epoch | Training Loss | Validation Loss | Accuracy | F1 Score | |-------|---------------|-----------------|----------|----------| | 1 | 0.2259 | 0.2567 | 92.89% | 92.88% | | 2 | 0.2002 | 0.2360 | 93.44% | 93.43% | | 3 | 0.1388 | 0.1925 | 96.11% | 96.11% | | 4 | 0.1322 | 0.2161 | 95.67% | 95.67% | | 5 | 0.1182 | 0.2208 | **96.22%** | **96.22%** | | 6-10 | 0.1170-0.1171 | 0.2132-0.2142 | 95.67-95.78% | 95.67-95.78% | ## Usage ### Loading the Model ```python from transformers import ViTImageProcessor, ViTForImageClassification from PIL import Image import torch # Load model and processor model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/vit-deepfake-detector") processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/vit-deepfake-detector") # Load and preprocess image image = Image.open("path_to_image.jpg") inputs = processor(images=image, return_tensors="pt") # Make prediction with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_class = logits.argmax(-1).item() # Get label labels = {0: "real", 1: "fake"} print(f"Prediction: {labels[predicted_class]}") # Get confidence scores probabilities = torch.nn.functional.softmax(logits, dim=-1) confidence = probabilities[0][predicted_class].item() print(f"Confidence: {confidence:.2%}") ``` ### Batch Prediction ```python from transformers import pipeline # Create classification pipeline classifier = pipeline("image-classification", model="YOUR_USERNAME/vit-deepfake-detector") # Predict on single image result = classifier("path_to_image.jpg") print(result) # Predict on multiple images images = ["image1.jpg", "image2.jpg", "image3.jpg"] results = classifier(images) for img, result in zip(images, results): print(f"{img}: {result}") ``` ## Limitations and Biases ### Known Limitations - **Dataset Bias:** The model was trained on the OpenForensics dataset, which may not represent all types of deepfakes or manipulation techniques - **Generalization:** Performance may degrade on deepfake generation methods not present in the training data - **Adversarial Robustness:** The model has not been explicitly tested against adversarial attacks - **Resolution Dependency:** Best performance on images around 224x224 pixels (ViT input size) ### Potential Biases - The model's performance may vary across different: - Image sources and quality levels - Demographic representations in images - Types of manipulation techniques - Content domains (faces, landscapes, objects, etc.) ## Ethical Considerations - This model should be used responsibly and not for harassment or privacy invasion - Decisions based on this model should involve human oversight, especially in high-stakes scenarios - Users should be aware that deepfake detection is an evolving field, and no model is perfect - False positives and false negatives can have real-world consequences ## Citation If you use this model, please cite: ```bibtex @misc{vit-deepfake-detector, author = {YOUR_NAME}, title = {ViT Deepfake Detection Model}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/YOUR_USERNAME/vit-deepfake-detector}} } ``` ## Author - Dr. Lucy Liu - Muhammad Hamza Sohail - Ayaan Mohammed - Shadab Karim - kirti Dhir ## **Disclaimer:** his model is provided for research and educational purposes. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when deploying this model.