Deepfake Detection via LoRA Fine-Tuned ViT
Binary classifier distinguishing real portrait photos from AI-generated faces. Fine-tunes a pre-trained ViT-B/16 using LoRA adapters (PEFT), keeping 99%+ of the backbone frozen while adapting only the attention projections. LoRA adapters are merged before export β no PEFT dependency at inference time.
Primary dataset: 140K Real and Fake Faces β 140 000 images, perfectly balanced, predefined train/valid/test split. Real faces from Flickr, fake faces generated with StyleGAN2.
Model Details
| Property | Value |
|---|---|
| Backbone | ViT-B/16 (google/vit-base-patch16-224) |
| Adapter | LoRA r=16, target: query + value |
| Trained params | 590k / 86M (0.68%) |
| Input | RGB image, 224Γ224, ImageNet normalisation |
| Output | Single logit (sigmoid β fake probability) |
Usage
import numpy as np
import onnxruntime as ort
from PIL import Image
from torchvision.transforms import CenterCrop, Compose, Normalize, Resize, ToTensor
transform = Compose([
Resize(256),
CenterCrop(224),
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
session = ort.InferenceSession("model_quint8.onnx")
img = Image.open("face.jpg").convert("RGB")
x = transform(img).unsqueeze(0).numpy()
logit = session.run(None, {"input": x})[0][0, 0]
prob_fake = float(1 / (1 + np.exp(-logit)))
print(f"Fake probability: {prob_fake:.3f}")
Results
Dataset: 140K Real and Fake Faces β 100k train / 20k val / 20k test, perfectly balanced. Model: ViT-B/16 + LoRA (r=16, target: query + value projections) Training: 10 epochs, AdamW, cosine LR with warmup, batch size 128
Classification (test set):
| Model | Accuracy | AUROC | F1 |
|---|---|---|---|
| PyTorch FP32 | 99.29% | 99.98% | 99.28% |
| ONNX FP32 | 99.29% | 99.98% | 99.28% |
| ONNX INT8 | 99.13% | 99.97% | 99.14% |
| ONNX UINT8 | 99.18% | 99.97% | 99.17% |
Quantization benchmark (CPU, 100 inference runs, batch size 1):
| Model | Size (MB) | Latency mean (ms) | Latency std (ms) | Size Ξ | Latency Ξ |
|---|---|---|---|---|---|
| ONNX FP32 | 327.5 | 136.3 | 36.8 | β | β |
| ONNX INT8 | 82.9 | 46.9 | 10.2 | β74.7% | β65.6% |
The model converges rapidly β 96.8% accuracy is already reached after epoch 2, with diminishing gains thereafter. LoRA keeps 99%+ of backbone parameters frozen throughout, training only ~0.68% of total parameters (590k adapter params on top of 86M ViT-B/16 backbone).
Dynamic INT8 quantization reduces model size by 4Γ and latency by 3Γ with a negligible 0.16 percentage point accuracy drop.