Deepfake Detection with SigLIP and LoRA

Overview

This project focuses on fine-tuning a pre-trained SigLIP (Sigmoid Loss for Language Image Pre-Training) model for the task of deepfake detection. Leveraging the power of Transfer Learning and Low-Rank Adaptation (LoRA), we adapt a google/siglip-base-patch16-512 model (specifically initialized from prithivMLmods/deepfake-detector-model-v1) to classify images as either "Real" or "Fake".

The model is trained on a balanced subset of the OpenRL/DeepFakeFace dataset, containing 12,000 images. Due to hardware constraints, this subset was carefully selected to ensure diverse representation of various generative techniques (Stable Diffusion, Inpainting, InsightFace).

Key Features

State-of-the-Art Architecture: Utilizes SigLIP, which offers better performance and stability compared to standard ViT models by using a pairwise sigmoid loss for image-text pre-training.
Parameter-Efficient Fine-Tuning: Uses LoRA (Low-Rank Adaptation) to fine-tune the model with significantly fewer trainable parameters (r=16), making training feasible on consumer GPUs.
Robust Data Augmentation: Implements an extensive augmentation pipeline including ColorJitter, RandomResizedCrop, RandomRotation, RandomAdjustSharpness, and GaussianBlur to improve model generalization.
High Precision: Achieves ~89.57% accuracy on the test set, effectively distinguishing between real faces and AI-generated content.

Model Architecture

Base Architecture: SigLIP (Sigmoid Loss for Language Image Pre-Training)
Base Model: google/siglip-base-patch16-512
Pre-trained Weights: Sourced from prithivMLmods/deepfake-detector-model-v1
Fine-tuning Method: PEFT (Parameter-Efficient Fine-Tuning) with LoRA
- Target Modules: q_proj, v_proj
- Rank (r): 16
- Alpha: 32
- Dropout: 0.1

Dataset

The project uses the DeepFakeFace (DFF) dataset OpenRL/DeepFakeFace. A balanced subset of 12,000 images was curated using a custom selection script.

Data Distribution (12,000 Images Total)

Class	Count	Source / Generator	Description
Real	6,000	`wiki` dataset	Real human faces from Wikipedia
Fake	2,000	`text2img`	Generated via Stable Diffusion v1.5
Fake	2,000	`inpainting`	Generated via SD Inpainting
Fake	2,000	`insight`	Generated via InsightFace

Data Splits

Train: 70% (8,400 images)
Validation: 20% (2,401 images)
Test: 10% (1,199 images)

Training Details

The model was trained using the Hugging Face Trainer API with the following configuration:

Hyperparameters

Optimizer: AdamW
Learning Rate: 1e-4
Scheduler: Cosine with Warmup (ratio 0.1)
Batch Size: 16 (Train) / 32 (Eval)
Epochs: 10
Weight Decay: 0.01
Precision: FP16 (Mixed Precision)
Loss Function: CrossEntropyLoss (implicit)

Data Augmentation

To prevent overfitting, the following transformations are applied during training:

RandomResizedCrop: Scales 0.8-1.0 of the original image.
RandomHorizontalFlip: Probability 0.5.
RandomRotation: Up to 15 degrees.
ColorJitter: Brightness (±10%), Contrast (±10%), Saturation (±10%), Hue (±5%).
RandomAdjustSharpness: Factor 2, Probability 0.3.
GaussianBlur: Kernel size 3.
Normalization: Standard Mean/Std.

Performance (Test Set)

Train Loss & Validation Loss:
Confusion Matrix:
Test Accuracy: 89.57%
Test F1-Score: 89.51%
Test Precision: 90.55%
Test Recall: 89.57%
Test Loss: 0.2661

Inference

1. Using Hugging Face Pipeline

from transformers import pipeline

# Load the pipeline
pipe = pipeline("image-classification", model="shunda012/siglip-deepfake-detector")

# Predict on an image
image_path = "path_to_image.jpg"
result = pipe(image_path)
print(result)

2. Using PyTorch

from transformers import SiglipForImageClassification, AutoImageProcessor
import torch
from PIL import Image

# Load Model & Processor
model = SiglipForImageClassification.from_pretrained("shunda012/siglip-deepfake-detector")
processor = AutoImageProcessor.from_pretrained("shunda012/siglip-deepfake-detector")

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Load and preprocess the image
image = Image.open("path_to_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to(device)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    predicted_class_idx = torch.argmax(probs, dim=-1).item()

# Get label
id2label = model.config.id2label
predicted_label = id2label[predicted_class_idx]
confidence = probs[0][predicted_class_idx].item()

print(f"Prediction: {predicted_label} ({confidence:.2%})")
print(f"Probabilities: {probs[0].tolist()}")

Limitations

Hardware and Data Limits: The model was trained on a subset of 12,000 images due to hardware constraints. Training on the full 120k dataset would likely yield even higher accuracy.
Generalization: While the dataset includes Stable Diffusion, Inpainting, and InsightFace, the model may struggle with newer generation methods (e.g., Flux, Midjourney v6) that were not present in the training set.
Resolution: The model operates at a fixed resolution (typically 512x512 for this SigLIP variant). Resize artifacts or extremely high-resolution inputs might affect detection performance.

Future Work

Video Analysis: Extend the model to process video inputs frame-by-frame to utilize temporal consistency for better detection.
Multimodal Detection: Incorporate audio analysis to detect lip-sync inconsistencies in deepfake videos.
Adversarial Training: Train against adversarial examples to make the model more robust against evasion attacks.
Expand Dataset: Incorporate newer generative models into the training set to keep up with the rapidly evolving generative AI landscape.

Downloads last month: 15

Safetensors

Model size

92.9M params

Tensor type

F32

Model tree for shunda012/siglip-deepfake-detector

Base model

google/siglip-base-patch16-512

Finetuned

(3)

this model

shunda012
/

siglip-deepfake-detector