File size: 5,943 Bytes

---
license: mit
language:
- en
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- google/vit-base-patch16-224-in21k
library_name: transformers
tags:
- deepfake detection
- fake-image detection
---
# ViT Deepfake Detection Model

## Model Description

This is a fine-tuned Vision Transformer (ViT) model for binary image classification to detect deepfake images. The model is based on `google/vit-base-patch16-224-in21k` and has been fine-tuned on the OpenForensics dataset to distinguish between real and fake (AI-generated/manipulated) images.

## Model Details

- **Model Type:** Vision Transformer (ViT) for Image Classification
- **Base Model:** google/vit-base-patch16-224-in21k
- **Task:** Binary Image Classification (Real vs Fake Detection)
- **Language:** N/A (Computer Vision)
- **License:** Apache 2.0

## Intended Use

### Primary Use Cases
- Detecting AI-generated or manipulated images
- Content moderation and verification
- Research in deepfake detection
- Media authenticity verification

### Out-of-Scope Use
- This model should not be used as the sole method for making critical decisions about content authenticity
- Not intended for surveillance or privacy-invasive applications
- May not generalize well to deepfake techniques not present in the training data

## Training Data

The model was trained on the **OpenForensics dataset** with the following distribution:

- **Training Set:** 16,000 images
- **Validation Set:** 2000 images
- **Test Set:** 2000 images

Images were preprocessed and transformed using ViTImageProcessor with standard normalization.

## Training Procedure

### Hyperparameters

```python
Training Arguments:
- Batch Size: 24 per device
- Gradient Accumulation Steps: 1
- Mixed Precision: FP16
- Number of Epochs: 10
- Learning Rate: 3e-5
- Weight Decay: 0.02
- Warmup Ratio: 0.08
- LR Scheduler: Cosine
- Label Smoothing: 0.05
- Optimizer: AdamW (default)
```

### Training Hardware
- GPU: Tesla T4
- Training Time: ~14 minutes for 10 epochs

### Data Augmentation
Standard ViT preprocessing with normalization applied via `ViTImageProcessor`.

## Performance

### Validation Set Results (Best Epoch - Epoch 5)

| Metric | Score |
|--------|-------|
| Accuracy | 96.22% |
| F1 Score | 96.22% |
| Precision | 96.30% |
| Recall | 96.22% |

### Test Set Results

| Metric | Score |
|--------|-------|
| Accuracy | **96.56%** |

### Training Progress

The model showed consistent improvement across epochs:

| Epoch | Training Loss | Validation Loss | Accuracy | F1 Score |
|-------|---------------|-----------------|----------|----------|
| 1 | 0.2259 | 0.2567 | 92.89% | 92.88% |
| 2 | 0.2002 | 0.2360 | 93.44% | 93.43% |
| 3 | 0.1388 | 0.1925 | 96.11% | 96.11% |
| 4 | 0.1322 | 0.2161 | 95.67% | 95.67% |
| 5 | 0.1182 | 0.2208 | **96.22%** | **96.22%** |
| 6-10 | 0.1170-0.1171 | 0.2132-0.2142 | 95.67-95.78% | 95.67-95.78% |

## Usage

### Loading the Model

```python
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/vit-deepfake-detector")

# Load and preprocess image
image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

# Get label
labels = {0: "real", 1: "fake"}
print(f"Prediction: {labels[predicted_class]}")

# Get confidence scores
probabilities = torch.nn.functional.softmax(logits, dim=-1)
confidence = probabilities[0][predicted_class].item()
print(f"Confidence: {confidence:.2%}")
```

### Batch Prediction

```python
from transformers import pipeline

# Create classification pipeline
classifier = pipeline("image-classification", model="YOUR_USERNAME/vit-deepfake-detector")

# Predict on single image
result = classifier("path_to_image.jpg")
print(result)

# Predict on multiple images
images = ["image1.jpg", "image2.jpg", "image3.jpg"]
results = classifier(images)
for img, result in zip(images, results):
    print(f"{img}: {result}")
```

## Limitations and Biases

### Known Limitations
- **Dataset Bias:** The model was trained on the OpenForensics dataset, which may not represent all types of deepfakes or manipulation techniques
- **Generalization:** Performance may degrade on deepfake generation methods not present in the training data
- **Adversarial Robustness:** The model has not been explicitly tested against adversarial attacks
- **Resolution Dependency:** Best performance on images around 224x224 pixels (ViT input size)

### Potential Biases
- The model's performance may vary across different:
  - Image sources and quality levels
  - Demographic representations in images
  - Types of manipulation techniques
  - Content domains (faces, landscapes, objects, etc.)

## Ethical Considerations

- This model should be used responsibly and not for harassment or privacy invasion
- Decisions based on this model should involve human oversight, especially in high-stakes scenarios
- Users should be aware that deepfake detection is an evolving field, and no model is perfect
- False positives and false negatives can have real-world consequences

## Citation

If you use this model, please cite:

```bibtex
@misc{vit-deepfake-detector,
  author = {YOUR_NAME},
  title = {ViT Deepfake Detection Model},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/vit-deepfake-detector}}
}
```
## Author
- Dr. Lucy Liu
- Muhammad Hamza Sohail
- Ayaan Mohammed
- Shadab Karim
- kirti Dhir
## **Disclaimer:**
 his model is provided for research and educational purposes. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when deploying this model.