ViT Deepfake Detection Model

Model Description

This is a fine-tuned Vision Transformer (ViT) model for binary image classification to detect deepfake images. The model is based on google/vit-base-patch16-224-in21k and has been fine-tuned on the OpenForensics dataset to distinguish between real and fake (AI-generated/manipulated) images.

Model Details

  • Model Type: Vision Transformer (ViT) for Image Classification
  • Base Model: google/vit-base-patch16-224-in21k
  • Task: Binary Image Classification (Real vs Fake Detection)
  • Language: N/A (Computer Vision)
  • License: Apache 2.0

Intended Use

Primary Use Cases

  • Detecting AI-generated or manipulated images
  • Content moderation and verification
  • Research in deepfake detection
  • Media authenticity verification

Out-of-Scope Use

  • This model should not be used as the sole method for making critical decisions about content authenticity
  • Not intended for surveillance or privacy-invasive applications
  • May not generalize well to deepfake techniques not present in the training data

Training Data

The model was trained on the OpenForensics dataset with the following distribution:

  • Training Set: 16,000 images
  • Validation Set: 2000 images
  • Test Set: 2000 images

Images were preprocessed and transformed using ViTImageProcessor with standard normalization.

Training Procedure

Hyperparameters

Training Arguments:
- Batch Size: 24 per device
- Gradient Accumulation Steps: 1
- Mixed Precision: FP16
- Number of Epochs: 10
- Learning Rate: 3e-5
- Weight Decay: 0.02
- Warmup Ratio: 0.08
- LR Scheduler: Cosine
- Label Smoothing: 0.05
- Optimizer: AdamW (default)

Training Hardware

  • GPU: Tesla T4
  • Training Time: ~14 minutes for 10 epochs

Data Augmentation

Standard ViT preprocessing with normalization applied via ViTImageProcessor.

Performance

Validation Set Results (Best Epoch - Epoch 5)

Metric Score
Accuracy 96.22%
F1 Score 96.22%
Precision 96.30%
Recall 96.22%

Test Set Results

Metric Score
Accuracy 96.56%

Training Progress

The model showed consistent improvement across epochs:

Epoch Training Loss Validation Loss Accuracy F1 Score
1 0.2259 0.2567 92.89% 92.88%
2 0.2002 0.2360 93.44% 93.43%
3 0.1388 0.1925 96.11% 96.11%
4 0.1322 0.2161 95.67% 95.67%
5 0.1182 0.2208 96.22% 96.22%
6-10 0.1170-0.1171 0.2132-0.2142 95.67-95.78% 95.67-95.78%

Usage

Loading the Model

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")

# Load and preprocess image
image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

# Get label
labels = {0: "real", 1: "fake"}
print(f"Prediction: {labels[predicted_class]}")

# Get confidence scores
probabilities = torch.nn.functional.softmax(logits, dim=-1)
confidence = probabilities[0][predicted_class].item()
print(f"Confidence: {confidence:.2%}")

Batch Prediction

from transformers import pipeline

# Create classification pipeline
classifier = pipeline("image-classification", model="YOUR_USERNAME/vit-deepfake-detector")

# Predict on single image
result = classifier("path_to_image.jpg")
print(result)

# Predict on multiple images
images = ["image1.jpg", "image2.jpg", "image3.jpg"]
results = classifier(images)
for img, result in zip(images, results):
    print(f"{img}: {result}")

Limitations and Biases

Known Limitations

  • Dataset Bias: The model was trained on the OpenForensics dataset, which may not represent all types of deepfakes or manipulation techniques
  • Generalization: Performance may degrade on deepfake generation methods not present in the training data
  • Adversarial Robustness: The model has not been explicitly tested against adversarial attacks
  • Resolution Dependency: Best performance on images around 224x224 pixels (ViT input size)

Potential Biases

  • The model's performance may vary across different:
    • Image sources and quality levels
    • Demographic representations in images
    • Types of manipulation techniques
    • Content domains (faces, landscapes, objects, etc.)

Ethical Considerations

  • This model should be used responsibly and not for harassment or privacy invasion
  • Decisions based on this model should involve human oversight, especially in high-stakes scenarios
  • Users should be aware that deepfake detection is an evolving field, and no model is perfect
  • False positives and false negatives can have real-world consequences

Citation

If you use this model, please cite:

@misc{vit-deepfake-detector,
  author = {YOUR_NAME},
  title = {ViT Deepfake Detection Model},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/vit-deepfake-detector}}
}

Author

  • Dr. Lucy Liu
  • Muhammad Hamza Sohail
  • Ayaan Mohammed
  • Shadab Karim
  • kirti Dhir

Disclaimer:

his model is provided for research and educational purposes. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when deploying this model.

Downloads last month
45
Safetensors
Model size
85.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hamzenium/ViT-Deepfake-Classifier

Finetuned
(2476)
this model