|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
base_model: |
|
|
- google/vit-base-patch16-224-in21k |
|
|
library_name: transformers |
|
|
tags: |
|
|
- deepfake detection |
|
|
- fake-image detection |
|
|
--- |
|
|
# ViT Deepfake Detection Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a fine-tuned Vision Transformer (ViT) model for binary image classification to detect deepfake images. The model is based on `google/vit-base-patch16-224-in21k` and has been fine-tuned on the OpenForensics dataset to distinguish between real and fake (AI-generated/manipulated) images. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type:** Vision Transformer (ViT) for Image Classification |
|
|
- **Base Model:** google/vit-base-patch16-224-in21k |
|
|
- **Task:** Binary Image Classification (Real vs Fake Detection) |
|
|
- **Language:** N/A (Computer Vision) |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- Detecting AI-generated or manipulated images |
|
|
- Content moderation and verification |
|
|
- Research in deepfake detection |
|
|
- Media authenticity verification |
|
|
|
|
|
### Out-of-Scope Use |
|
|
- This model should not be used as the sole method for making critical decisions about content authenticity |
|
|
- Not intended for surveillance or privacy-invasive applications |
|
|
- May not generalize well to deepfake techniques not present in the training data |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the **OpenForensics dataset** with the following distribution: |
|
|
|
|
|
- **Training Set:** 16,000 images |
|
|
- **Validation Set:** 2000 images |
|
|
- **Test Set:** 2000 images |
|
|
|
|
|
Images were preprocessed and transformed using ViTImageProcessor with standard normalization. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
```python |
|
|
Training Arguments: |
|
|
- Batch Size: 24 per device |
|
|
- Gradient Accumulation Steps: 1 |
|
|
- Mixed Precision: FP16 |
|
|
- Number of Epochs: 10 |
|
|
- Learning Rate: 3e-5 |
|
|
- Weight Decay: 0.02 |
|
|
- Warmup Ratio: 0.08 |
|
|
- LR Scheduler: Cosine |
|
|
- Label Smoothing: 0.05 |
|
|
- Optimizer: AdamW (default) |
|
|
``` |
|
|
|
|
|
### Training Hardware |
|
|
- GPU: Tesla T4 |
|
|
- Training Time: ~14 minutes for 10 epochs |
|
|
|
|
|
### Data Augmentation |
|
|
Standard ViT preprocessing with normalization applied via `ViTImageProcessor`. |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Validation Set Results (Best Epoch - Epoch 5) |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Accuracy | 96.22% | |
|
|
| F1 Score | 96.22% | |
|
|
| Precision | 96.30% | |
|
|
| Recall | 96.22% | |
|
|
|
|
|
### Test Set Results |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Accuracy | **96.56%** | |
|
|
|
|
|
### Training Progress |
|
|
|
|
|
The model showed consistent improvement across epochs: |
|
|
|
|
|
| Epoch | Training Loss | Validation Loss | Accuracy | F1 Score | |
|
|
|-------|---------------|-----------------|----------|----------| |
|
|
| 1 | 0.2259 | 0.2567 | 92.89% | 92.88% | |
|
|
| 2 | 0.2002 | 0.2360 | 93.44% | 93.43% | |
|
|
| 3 | 0.1388 | 0.1925 | 96.11% | 96.11% | |
|
|
| 4 | 0.1322 | 0.2161 | 95.67% | 95.67% | |
|
|
| 5 | 0.1182 | 0.2208 | **96.22%** | **96.22%** | |
|
|
| 6-10 | 0.1170-0.1171 | 0.2132-0.2142 | 95.67-95.78% | 95.67-95.78% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
from transformers import ViTImageProcessor, ViTForImageClassification |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model and processor |
|
|
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/vit-deepfake-detector") |
|
|
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/vit-deepfake-detector") |
|
|
|
|
|
# Load and preprocess image |
|
|
image = Image.open("path_to_image.jpg") |
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
|
|
|
# Make prediction |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
predicted_class = logits.argmax(-1).item() |
|
|
|
|
|
# Get label |
|
|
labels = {0: "real", 1: "fake"} |
|
|
print(f"Prediction: {labels[predicted_class]}") |
|
|
|
|
|
# Get confidence scores |
|
|
probabilities = torch.nn.functional.softmax(logits, dim=-1) |
|
|
confidence = probabilities[0][predicted_class].item() |
|
|
print(f"Confidence: {confidence:.2%}") |
|
|
``` |
|
|
|
|
|
### Batch Prediction |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Create classification pipeline |
|
|
classifier = pipeline("image-classification", model="YOUR_USERNAME/vit-deepfake-detector") |
|
|
|
|
|
# Predict on single image |
|
|
result = classifier("path_to_image.jpg") |
|
|
print(result) |
|
|
|
|
|
# Predict on multiple images |
|
|
images = ["image1.jpg", "image2.jpg", "image3.jpg"] |
|
|
results = classifier(images) |
|
|
for img, result in zip(images, results): |
|
|
print(f"{img}: {result}") |
|
|
``` |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
- **Dataset Bias:** The model was trained on the OpenForensics dataset, which may not represent all types of deepfakes or manipulation techniques |
|
|
- **Generalization:** Performance may degrade on deepfake generation methods not present in the training data |
|
|
- **Adversarial Robustness:** The model has not been explicitly tested against adversarial attacks |
|
|
- **Resolution Dependency:** Best performance on images around 224x224 pixels (ViT input size) |
|
|
|
|
|
### Potential Biases |
|
|
- The model's performance may vary across different: |
|
|
- Image sources and quality levels |
|
|
- Demographic representations in images |
|
|
- Types of manipulation techniques |
|
|
- Content domains (faces, landscapes, objects, etc.) |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- This model should be used responsibly and not for harassment or privacy invasion |
|
|
- Decisions based on this model should involve human oversight, especially in high-stakes scenarios |
|
|
- Users should be aware that deepfake detection is an evolving field, and no model is perfect |
|
|
- False positives and false negatives can have real-world consequences |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{vit-deepfake-detector, |
|
|
author = {YOUR_NAME}, |
|
|
title = {ViT Deepfake Detection Model}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/vit-deepfake-detector}} |
|
|
} |
|
|
``` |
|
|
## Author |
|
|
- Dr. Lucy Liu |
|
|
- Muhammad Hamza Sohail |
|
|
- Ayaan Mohammed |
|
|
- Shadab Karim |
|
|
- kirti Dhir |
|
|
## **Disclaimer:** |
|
|
his model is provided for research and educational purposes. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when deploying this model. |