ViT Deepfake Detection Model

Model Description

This is a fine-tuned Vision Transformer (ViT) model for binary image classification to detect deepfake images. The model is based on google/vit-base-patch16-224-in21k and has been fine-tuned on the OpenForensics dataset to distinguish between real and fake (AI-generated/manipulated) images.

Model Details

Model Type: Vision Transformer (ViT) for Image Classification
Base Model: google/vit-base-patch16-224-in21k
Task: Binary Image Classification (Real vs Fake Detection)
Language: N/A (Computer Vision)
License: Apache 2.0

Intended Use

Primary Use Cases

Detecting AI-generated or manipulated images
Content moderation and verification
Research in deepfake detection
Media authenticity verification

Out-of-Scope Use

This model should not be used as the sole method for making critical decisions about content authenticity
Not intended for surveillance or privacy-invasive applications
May not generalize well to deepfake techniques not present in the training data

Training Data

The model was trained on the OpenForensics dataset with the following distribution:

Training Set: 16,000 images
Validation Set: 2000 images
Test Set: 2000 images

Images were preprocessed and transformed using ViTImageProcessor with standard normalization.

Training Procedure

Hyperparameters

Training Arguments:
- Batch Size: 24 per device
- Gradient Accumulation Steps: 1
- Mixed Precision: FP16
- Number of Epochs: 10
- Learning Rate: 3e-5
- Weight Decay: 0.02
- Warmup Ratio: 0.08
- LR Scheduler: Cosine
- Label Smoothing: 0.05
- Optimizer: AdamW (default)

Training Hardware

GPU: Tesla T4
Training Time: ~14 minutes for 10 epochs

Data Augmentation

Standard ViT preprocessing with normalization applied via ViTImageProcessor.

Performance

Validation Set Results (Best Epoch - Epoch 5)

Metric	Score
Accuracy	96.22%
F1 Score	96.22%
Precision	96.30%
Recall	96.22%

Test Set Results

Metric	Score
Accuracy	96.56%

Training Progress

The model showed consistent improvement across epochs:

Epoch	Training Loss	Validation Loss	Accuracy	F1 Score
1	0.2259	0.2567	92.89%	92.88%
2	0.2002	0.2360	93.44%	93.43%
3	0.1388	0.1925	96.11%	96.11%
4	0.1322	0.2161	95.67%	95.67%
5	0.1182	0.2208	96.22%	96.22%
6-10	0.1170-0.1171	0.2132-0.2142	95.67-95.78%	95.67-95.78%

Usage

Loading the Model

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")

# Load and preprocess image
image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

# Get label
labels = {0: "real", 1: "fake"}
print(f"Prediction: {labels[predicted_class]}")

# Get confidence scores
probabilities = torch.nn.functional.softmax(logits, dim=-1)
confidence = probabilities[0][predicted_class].item()
print(f"Confidence: {confidence:.2%}")

Batch Prediction

from transformers import pipeline

# Create classification pipeline
classifier = pipeline("image-classification", model="YOUR_USERNAME/vit-deepfake-detector")

# Predict on single image
result = classifier("path_to_image.jpg")
print(result)

# Predict on multiple images
images = ["image1.jpg", "image2.jpg", "image3.jpg"]
results = classifier(images)
for img, result in zip(images, results):
    print(f"{img}: {result}")

Limitations and Biases

Known Limitations

Dataset Bias: The model was trained on the OpenForensics dataset, which may not represent all types of deepfakes or manipulation techniques
Generalization: Performance may degrade on deepfake generation methods not present in the training data
Adversarial Robustness: The model has not been explicitly tested against adversarial attacks
Resolution Dependency: Best performance on images around 224x224 pixels (ViT input size)

Potential Biases

The model's performance may vary across different:
- Image sources and quality levels
- Demographic representations in images
- Types of manipulation techniques
- Content domains (faces, landscapes, objects, etc.)

Ethical Considerations

This model should be used responsibly and not for harassment or privacy invasion
Decisions based on this model should involve human oversight, especially in high-stakes scenarios
Users should be aware that deepfake detection is an evolving field, and no model is perfect
False positives and false negatives can have real-world consequences

Citation

If you use this model, please cite:

@misc{vit-deepfake-detector,
  author = {YOUR_NAME},
  title = {ViT Deepfake Detection Model},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/vit-deepfake-detector}}
}

Author

Dr. Lucy Liu
Muhammad Hamza Sohail
Ayaan Mohammed
Shadab Karim
kirti Dhir

Disclaimer:

his model is provided for research and educational purposes. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when deploying this model.

Downloads last month: 82

Safetensors

Model size

85.8M params

Tensor type

F32

Model tree for hamzenium/ViT-Deepfake-Classifier

Base model

google/vit-base-patch16-224-in21k

Finetuned

(2541)

this model

hamzenium
/

ViT-Deepfake-Classifier