hamzenium's picture
Update README.md
0d890a3 verified
metadata
license: mit
language:
  - en
metrics:
  - accuracy
  - f1
  - precision
  - recall
base_model:
  - google/vit-base-patch16-224-in21k
library_name: transformers
tags:
  - deepfake detection
  - fake-image detection

ViT Deepfake Detection Model

Model Description

This is a fine-tuned Vision Transformer (ViT) model for binary image classification to detect deepfake images. The model is based on google/vit-base-patch16-224-in21k and has been fine-tuned on the OpenForensics dataset to distinguish between real and fake (AI-generated/manipulated) images.

Model Details

  • Model Type: Vision Transformer (ViT) for Image Classification
  • Base Model: google/vit-base-patch16-224-in21k
  • Task: Binary Image Classification (Real vs Fake Detection)
  • Language: N/A (Computer Vision)
  • License: Apache 2.0

Intended Use

Primary Use Cases

  • Detecting AI-generated or manipulated images
  • Content moderation and verification
  • Research in deepfake detection
  • Media authenticity verification

Out-of-Scope Use

  • This model should not be used as the sole method for making critical decisions about content authenticity
  • Not intended for surveillance or privacy-invasive applications
  • May not generalize well to deepfake techniques not present in the training data

Training Data

The model was trained on the OpenForensics dataset with the following distribution:

  • Training Set: 16,000 images
  • Validation Set: 2000 images
  • Test Set: 2000 images

Images were preprocessed and transformed using ViTImageProcessor with standard normalization.

Training Procedure

Hyperparameters

Training Arguments:
- Batch Size: 24 per device
- Gradient Accumulation Steps: 1
- Mixed Precision: FP16
- Number of Epochs: 10
- Learning Rate: 3e-5
- Weight Decay: 0.02
- Warmup Ratio: 0.08
- LR Scheduler: Cosine
- Label Smoothing: 0.05
- Optimizer: AdamW (default)

Training Hardware

  • GPU: Tesla T4
  • Training Time: ~14 minutes for 10 epochs

Data Augmentation

Standard ViT preprocessing with normalization applied via ViTImageProcessor.

Performance

Validation Set Results (Best Epoch - Epoch 5)

Metric Score
Accuracy 96.22%
F1 Score 96.22%
Precision 96.30%
Recall 96.22%

Test Set Results

Metric Score
Accuracy 96.56%

Training Progress

The model showed consistent improvement across epochs:

Epoch Training Loss Validation Loss Accuracy F1 Score
1 0.2259 0.2567 92.89% 92.88%
2 0.2002 0.2360 93.44% 93.43%
3 0.1388 0.1925 96.11% 96.11%
4 0.1322 0.2161 95.67% 95.67%
5 0.1182 0.2208 96.22% 96.22%
6-10 0.1170-0.1171 0.2132-0.2142 95.67-95.78% 95.67-95.78%

Usage

Loading the Model

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")

# Load and preprocess image
image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

# Get label
labels = {0: "real", 1: "fake"}
print(f"Prediction: {labels[predicted_class]}")

# Get confidence scores
probabilities = torch.nn.functional.softmax(logits, dim=-1)
confidence = probabilities[0][predicted_class].item()
print(f"Confidence: {confidence:.2%}")

Batch Prediction

from transformers import pipeline

# Create classification pipeline
classifier = pipeline("image-classification", model="YOUR_USERNAME/vit-deepfake-detector")

# Predict on single image
result = classifier("path_to_image.jpg")
print(result)

# Predict on multiple images
images = ["image1.jpg", "image2.jpg", "image3.jpg"]
results = classifier(images)
for img, result in zip(images, results):
    print(f"{img}: {result}")

Limitations and Biases

Known Limitations

  • Dataset Bias: The model was trained on the OpenForensics dataset, which may not represent all types of deepfakes or manipulation techniques
  • Generalization: Performance may degrade on deepfake generation methods not present in the training data
  • Adversarial Robustness: The model has not been explicitly tested against adversarial attacks
  • Resolution Dependency: Best performance on images around 224x224 pixels (ViT input size)

Potential Biases

  • The model's performance may vary across different:
    • Image sources and quality levels
    • Demographic representations in images
    • Types of manipulation techniques
    • Content domains (faces, landscapes, objects, etc.)

Ethical Considerations

  • This model should be used responsibly and not for harassment or privacy invasion
  • Decisions based on this model should involve human oversight, especially in high-stakes scenarios
  • Users should be aware that deepfake detection is an evolving field, and no model is perfect
  • False positives and false negatives can have real-world consequences

Citation

If you use this model, please cite:

@misc{vit-deepfake-detector,
  author = {YOUR_NAME},
  title = {ViT Deepfake Detection Model},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/vit-deepfake-detector}}
}

Author

  • Dr. Lucy Liu
  • Muhammad Hamza Sohail
  • Ayaan Mohammed
  • Shadab Karim
  • kirti Dhir

Disclaimer:

his model is provided for research and educational purposes. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when deploying this model.