Deepfake Detection via LoRA Fine-Tuned ViT

Binary classifier distinguishing real portrait photos from AI-generated faces. Fine-tunes a pre-trained ViT-B/16 using LoRA adapters (PEFT), keeping 99%+ of the backbone frozen while adapting only the attention projections. LoRA adapters are merged before export — no PEFT dependency at inference time.

Primary dataset: 140K Real and Fake Faces — 140 000 images, perfectly balanced, predefined train/valid/test split. Real faces from Flickr, fake faces generated with StyleGAN2.

Model Details

Property	Value
Backbone	ViT-B/16 (`google/vit-base-patch16-224`)
Adapter	LoRA r=16, target: query + value
Trained params	590k / 86M (0.68%)
Input	RGB image, 224×224, ImageNet normalisation
Output	Single logit (sigmoid → fake probability)

Usage

import numpy as np
import onnxruntime as ort
from PIL import Image
from torchvision.transforms import CenterCrop, Compose, Normalize, Resize, ToTensor

transform = Compose([
    Resize(256),
    CenterCrop(224),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

session = ort.InferenceSession("model_quint8.onnx")

img = Image.open("face.jpg").convert("RGB")
x = transform(img).unsqueeze(0).numpy()
logit = session.run(None, {"input": x})[0][0, 0]
prob_fake = float(1 / (1 + np.exp(-logit)))
print(f"Fake probability: {prob_fake:.3f}")

Results

Dataset: 140K Real and Fake Faces — 100k train / 20k val / 20k test, perfectly balanced. Model: ViT-B/16 + LoRA (r=16, target: query + value projections) Training: 10 epochs, AdamW, cosine LR with warmup, batch size 128

Classification (test set):

Model	Accuracy	AUROC	F1
PyTorch FP32	99.29%	99.98%	99.28%
ONNX FP32	99.29%	99.98%	99.28%
ONNX INT8	99.13%	99.97%	99.14%
ONNX UINT8	99.18%	99.97%	99.17%

Quantization benchmark (CPU, 100 inference runs, batch size 1):

Model	Size (MB)	Latency mean (ms)	Latency std (ms)	Size Δ	Latency Δ
ONNX FP32	327.5	136.3	36.8	—	—
ONNX INT8	82.9	46.9	10.2	−74.7%	−65.6%

The model converges rapidly — 96.8% accuracy is already reached after epoch 2, with diminishing gains thereafter. LoRA keeps 99%+ of backbone parameters frozen throughout, training only ~0.68% of total parameters (590k adapter params on top of 86M ViT-B/16 backbone).

Dynamic INT8 quantization reduces model size by 4× and latency by 3× with a negligible 0.16 percentage point accuracy drop.

Downloads last month: -; Downloads are not tracked for this model. How to track